DEV Community: Patima Poochai

How To Resize LVM Volumes Dynamically Using Cloud-Init (Step-By-Step)

Patima Poochai — Thu, 26 Mar 2026 08:56:42 +0000

Logical Volume Manager (LVM) is the preferred storage framework for Linux VMs. It's easy to resize LVM volumes, migrate data between devices, and combine multiple physical disks into one easy-to-mange volume. These features are especially useful in a virtualized environment, as they enable you to do these tasks without physical access to the machine.

When creating VM templates, it's common to create a fixed-size volume for the root partition and use cloud-init to grow the partition size dynamically when cloning the template (unless you want to create multiple templates with each one having 10G, 11G, 12G... storage and so on). However, you can't simply use the built-in growpart module to resize logical volumes. It only works on partitions.

But that doesn't mean you should avoid using LVM in your templates. After a few trials and tribulations, I've found a workaround that allows cloud-init to automatically resize logical volumes, as well as a few pitfalls you should avoid when using cloud-init with Debian machines. Here's what I've learned.

The Two Steps to Grow LVM Volumes with Cloud-Init

Here's the scenario. There are 3 partitions on my VM template, with /dev/sda3 containing the root logical volume (LV) named ubuntu--vg-ubuntu--lv. The VM's disk had 30 GB of storage, but I've resized the drive to be 70 GB. The root LV is still 30 GB, so I need cloud-init to automatically resize the volume to fill the remaining space.

Grow the Partition

First, we need to grow the partition using the growpart module. While we can't use growpart directly on the root LV, we can use it to grow the partition that contains the underlying physical volume (PV) of the root volume.

Create a cloud-init configuration in /etc/cloud/cloud.cfg.d/90-LVM.cfg, and add the following:

growpart:
  devices: [/dev/sda3]

To test the config, set cloud-init to run on the next boot and reboot the computer.

sudo cloud-init clean -r

Here's how the partitions look after growpart.

Growpart expanded /dev/sda3 to 68 GB using the newly added free space in the storage disk.

Because the sda3 partition is mapped to the root PV, debian-vg, it also expanded the underlying PV of the volume. We can now expand the LV to fill the space in the physical volume.

Grow the Root Volume

Second, we need to grow the logical volume. There is no built-in cloud-init module to manage LVM volumes, but we can use the runcmd module to execute the shell commands to resize the logical volume.

One caveat is that we have to make sure runcmd only triggers after the growpart module. Otherwise, we are expanding the LV before the PV is expanded. We can check the modules execution order by looking at the default cloud-init config at /etc/cloud/cloud.cfg and making sure that runcmd is executed after growpart.

Modules in the "config" stages will always run after the "init" stage, so make sure that runcmd is under cloud_config_module.

Fun fact: runcmd actually defers the execution until the scripts_user module in the "final" boot stage, so make sure that scripts_user is under cloud_final_module as well.

Now, we can add a runcmd command to resize the root LV into /etc/cloud/cloud.cfg.d/90-LVM.cfg. In my case, the root path (/) of my VM is mounted to a logical volume named ubuntu-lv, which is mapped to a PV named ubuntu-vg, so my runcmd looks like this:

# append to /etc/cloud/cloud.cfg.d/90-LVM.cfg
runcmd:
  - [lvresize, -l, +100%FREE, -r, /dev/ubuntu-vg/ubuntu-lv]

This command resizes the volume using all of the remaining space (+100%FREE) while also resizing the file system in the volume (-r).

Test the config again by rebooting the machine and rerunning cloud-init:

sudo cloud-init clean -r

Upon reboot, cloud-init should resize the root LV to use the remaining free space in the storage disk.

Troubleshooting

If cloud-init isn't resizing the volumes as expected, your first troubleshooting step should be checking the logs at /var/log/cloud-init.log.

For example, I had some confusion regarding the possible values for the runcmd module when I started working on this project. Look at this:

Does this mean the module accepts 1) an array containing arrays containing strings, or 2) an array containing arrays of strings, or just a string? To get a practical understanding of the schema, I first tried using a simple string for the module.

After a reboot, cloud-init ran the configuration, but the LV is still the same size. Take a look at the logs:

Note the TypeError: Input to shellify was type 'str'. expected list or tuple line. This line tells us that the module expects the inputs to look like this in practice:

# this is correct
Array [
  - StringArray ["a", "b"]
  - String "abc"
  - Null null
]

# not this
Array [] || String "abc"

Niche issue? Probably. But if you need to write a more advanced runcmd config, reading /var/log/cloud-init.log can help clarify some of the ambiguity.

Debian-specific Issues ("No free sectors")

The steps above should work for most cases. However, when using cloud-init and growpart to resize volumes in Debian VMs, you might run into the following error:

Note the /dev/sda3: No free sectors available line. Our root LV is located on sda5, and growpart failed to resize that partition because it failed to resize sda3. But... that disk doesn't exist. Why is growpart trying to resize a non-existent disk?

In the documentation, growpart will only resize the last partition on the disk. In practice, this means the last partition. As in, the partition has to be last numerically, and you cannot skip numbers. If we want growpart to resize our root LV, it has to be on partition sda3 rather than sda5. In other words, 1,2,3 and not 1,2,5.

But why does our VM create the partitions in this way? It's because the MBR partitioning scheme requires any LVM storage to be on sda5. A GPT partition scheme won't have this issue, so we should change the partitioning scheme of our VM to use GPT instead.

But how do you make Debian use GPT over MBR? By using UEFI. The Debian installer automatically decides the partitioning scheme based on whether you're using BIOS or UEFI. So if we want the partitions to be in order without skipping numbers, we have to complete the installation process with UEFI enabled.

Convoluted? Yes, but the fix is simple: use UEFI during installation. In Proxmox, you can enable UEFI by changing the BIOS option to OVMF (UEFI):

Then add an EFI disk:

Then go through the installer as normal and choose Guided - use entire disk and set up LVM during installation. After completing the installation, your partitions should now be in order:

These are the partitions of my Debian VM after using UEFI during installation. Note how the root LV is now at /dev/sda3.

I could now use growpart and runcmd to resize the root LV. This is my configuration for the Debian VM:

Here's the result after rerunning cloud-init:

Cloud-init resized the root LV without the No free sectors issue.

Troubleshooting "status: disabled" and Unrecognized Cloud-init Drive Issues

You might run into other issues with cloud-init while setting up a Debian VM. Here's a short guide on how I troubleshoot and resolve these issues.

First, I got this error:

Cloud-init wasn't running, and running a status check tells us that it was disabled by the cloud-init-generator, but it doesn't tell us why.

Maybe the source code of cloud-init-generator can tell us more about the cause of the issue. You can find cloud-init-generator and its logs by running this command:

dpkg-query -L cloud-init | less

The command shows all the files that were installed with the cloud-init package. Look for cloud-init-generator:

It's in /usr/lib/systemd/system-generators/cloud-init-generator. Here's the first few lines of the file:

Note the LOG_F variable. That's the location of the log file where we can learn more about why Cloud-init was disabled.

Cloud-init used the ds-identify component to identify data sources, and it couldn't find any valid configuration sources. However, I've attached a cloud-init drive to the VM via the Proxmox GUI, so what's going on?

Data Sources Formatting

Let's check the logs related to ds-identify at /run/cloud-init/ds-identify.log (also recommended by the documentation).

From the WARN: no datasource_list found message, it seems like one of the problems is that the configuration expects you to use datasource_list in your configuration.

Here's how I changed the data sources configuration:

And here's the output of /run/cloud-init/ds-identify.log after applying this change:

Cloud-init now detects the sources list, but it still doesn't recognize the cloud-init device. I was puzzled by this issue for a while, until I found something a few days later.

Cloud-init Drive Interface

As of March 2026, according to this post and this post, there is a compatibility issue between IDE devices and OVMF. In practice, cloud-init drives that use IDE aren't recognized by the VM if you're using OVMF (UEFI).

The fix is simple: use SCSI for your cloud-init drive. When you're creating the cloud-init drive, choose the SCSI option in the Proxmox GUI:

For example, here are the hardware options for my VM:

Here's the list of block devices recognized by the VM. Note how, despite having two cdrom drives and one of them being the cloud-init drive, only the CD/DVD drive shows up in the list.

Now, I changed the cloud-init drive to use SCSI:

Here's the updated list of block devices. Note how the VM now recognizes the cloud-init drive at /dev/sr0. The VM should now recognize the cloud-init drive and execute your configuration on startup.

Closing

Questions? Thoughts? Feel free to leave a comment!

Need someone skilled in RHEL, Kubernetes, and AWS? I'm open to work! View my portfolio and reach out via LinkedIn or Mastodon.

How to fix the "opening socket 'charon.vici' failed: Permission denied" issue in IPsec

Patima Poochai — Mon, 02 Mar 2026 21:55:58 +0000

The Problem

I'm working on setting up SDN zones and IPsec encryption for my Proxmox VE 9 machines. I followed the usual guide on creating and encrypting SDN zones, installed the usual stuff like frr-pythontools and strongswan, but then ran into this issue when using swanctl to check the status of strongSwan:

root@pve:/etc/apt/sources.list.d# swanctl --stats
plugin 'test-vectors': failed to load - test_vectors_plugin_create not found and no plugin file available
...
plugin 'curl': failed to load - curl_plugin_create not found and no plugin file available
opening socket 'unix:///var/run/charon.vici' failed: Permission denied
Error: connecting to 'default' URI failed: Permission denied

At first, I thought this was a simple permissions issue, but the strongSwan service should be running as the root user. I checked the permissions of the socket at /var/run/charon.vici.

root@pve:/etc/apt/sources.list.d# ls -la /var/run/charon.vici
srwxrwx--- 1 root root 0 Feb 28 10:39 /var/run/charon.vici

The root user should have permissions for it, but the logs said it doesn't have the permission. What is going on?

Systemd Service Troubleshooting

My first instinct is to check the Systemd logs. strongswan.service is the service that manages IPsec, and it's the service that swanctl connects to.

root@pve:/etc/apt/sources.list.d# systemctl status strongswan
× strongswan.service - strongSwan IPsec IKEv1/IKEv2 daemon using swanctl
     Loaded: loaded (/usr/lib/systemd/system/strongswan.service; enabled; preset: enabled)
     Active: failed (Result: exit-code) since Sat 2026-02-28 10:39:09 HST; 3h 55min ago
    ...
    Process: 21205 ExecStart=/usr/sbin/charon-systemd (code=exited, status=0/SUCCESS)
    Process: 21234 ExecStartPost=/usr/sbin/swanctl --load-all --noprompt (code=exited, status=13)
   Main PID: 21205 (code=exited, status=0/SUCCESS)

StrongSwan is inactive; that's obvious. But one thing that stands out is that the charon-systemd, the process in the ExecStart line, started up fine. However, the problem comes from swanctl that is activated in ExecStartPost.

Let's look at the logs for strongswan.service.

root@pve:/etc/apparmor.d# journalctl -u strongswan.service 
Feb 28 10:39:09 pve systemd[1]: Starting strongswan.service - strongSwan IPsec IKEv1/IKEv2 daemon using swanctl...
Feb 28 10:39:09 pve charon-systemd[21205]: Starting charon-systemd IKE daemon (strongSwan 6.0.1, Linux 6.17.2-1-pve, x86_64)
...
Feb 28 10:39:09 pve charon-systemd[21205]: dropped capabilities, running as uid 0, gid 0
Feb 28 10:39:09 pve charon-systemd[21205]: spawning 16 worker threads
...
Feb 28 10:39:09 pve swanctl[21234]: opening socket 'unix:///var/run/charon.vici' failed: Permission denied
Feb 28 10:39:09 pve swanctl[21234]: Error: connecting to 'default' URI failed: Permission denied
...

We can see here that the main component of strongSwan, charon-systemd, starts perfectly fine. However, the process of loading the configuration files using swanctl is producing the same error, and this is error caused Systemd to label the strongswan.service as "failed" and stopping the service.

While I didn't find the cause of the issue in the logs, it helped me narrow the potential root cause to a single binary. And if I can fix this issue in swanctl, we can also fix strongswan.service failing as well.

Temporary Solution From AppArmor Troubleshooting

The source of this idea eludes me, but I stumbled upon a forum post when troubleshooting this issue that pointed out that this might be an issue related to AppArmor, and recommends that I check the entire journalctl logs for AppArmor denials. So I ran journalctl and looked for any logs related to swanctl and found this:

pve kernel: audit: type=1400 audit(1772327420.130:165): apparmor="DENIED" operation="create" class="net" info="failed protocol match" error=-13 profile="/usr/sbin/swanctl" pid=64058 comm="swanctl" family="unix" sock_type="stream" protocol=0 requested="create" denied="create" addr=none

The log states that AppArmor prevented swanctl from performing a "create" operation on a Unix socket. Looks like this denial might be the reason for the Permission denied error, but let's confirm that this is really the issue.

I tested the hypothesis by turning the AppArmor profile of swanctl into complaint mode, which temporarily removes any restrictions on the process.


apt install apparmor-utils # Install packages that provide apparmor_parser
apparmor_parser -Cr usr.sbin.swanctl # Disables AppArmor enforcement

Let's run strongswan.service now.

root@pve:/etc/apparmor.d# systemctl start strongswan
root@pve:/etc/apparmor.d# systemctl status strongswan
● strongswan.service - strongSwan IPsec IKEv1/IKEv2 daemon using swanctl
     Loaded: loaded (/usr/lib/systemd/system/strongswan.service; enabled; preset: enabled)
     Active: active (running) since Sat 2026-02-28 15:27:15 HST; 8s ago
     ...

strongswan.service is running fine, but what about swanctl?

root@pve:/etc/apparmor.d# swanctl --stats
...
uptime: 3 seconds, since Feb 28 15:33:52 2026
worker threads: 16 total, 11 idle, working: 4/0/1/0
job queues: 0/0/0/0
jobs scheduled: 0

Yup, looks like AppArmor was the cause of this issue. The AppArmor denial is preventing swanctl from creating or accessing the /var/run/charon.vici socket, which also causes strongswan.service to fail. By disabling AppArmor enforcement on swanctl, we can fix the Permission denied issue.

However, this is not the perfect solution. While we can run strongSwan without AppArmor, it provides MAC (Mandatory Access Control) rules that can prevent attackers from exploiting zero-day vulnerabilities in the strongSwan process.

StrongSwan is more secure with AppArmor enabled, and if I can find a way to run AppArmor without denying IPsec, I'd take it.

Trying Custom AppArmor Rules

So, if AppArmor is causing deny issues, what if we create a rule to allow swanctl to access this socket? I found another post that has a potential solution. Essentially, the author modified swanctl's AppArmor profile to explicitly allow the process to create and access any required Unix sockets.

Here's how I implemented the solution. I'm using VICI, strongswan's newer interface, so I add the following to /etc/apparmor.d/usr.sbin.swanctl:

unix (create) type=stream # Don't actually add this line. This is my own attempt at solving this issue.

I reloaded the profile.

apparmor_parser -r /etc/apparmor.d/usr.sbin.swanctl

Then I tried to start strongswan.service with the new profile.

root@pve:/etc/apparmor.d# systemctl start strongswan
Job for strongswan.service failed because the control process exited with error code.
...
root@pve:/etc/apparmor.d# journalctl -u strongswan
...
Feb 28 17:38:25 pve swanctl[87986]: opening socket 'unix:///var/run/charon.vici' failed: Permission denied
Feb 28 17:38:25 pve swanctl[87986]: Error: connecting to 'default' URI failed: Permission denied

Same error, no dice. However, this attempt did show me that the issue might not come from the profile itself, but something else related to AppArmor.

The Real Root Cause And The Long-Term Solution

It had been a long couple of days of searching for the solution to this issue before I stumble upon the hidden gem in this containerd repository issue. Basically, Alex Chernyakhovsky found that AppArmor changed its ABI recently and broke the Unix socket networking options in AppArmor profiles. We have to wait until the upstream developers fix this issue, but something we can do now is to configure AppArmor to use the previous ABI version.

Open /etc/apparmor/parser.conf and add this line:

# Force pre-kernel 6.17 ABI
override-policy-abi=/etc/apparmor.d/abi/4.0

Now we can reapply swanctl's profile. In my experience, I needed to clear the cache first using thr --purge-cache option before loading the new profile.

apparmor_parser --purge-cache
apparmor_parser -r /etc/apparmor.d/usr.sbin.swanctl

Here's my result after applying this fix:

It's working but only after still saying Permission denied at first. I'm not sure why this is happening, but in my experience, refreshing the AppArmor service and its policies a couple of times should make it read the new settings properly. It's fiddly like that.

This is only a stopgap solution until the upstream developers fix this issue, but compared to not having the AppArmor protection on the process, it's a sufficient solution for now.

Note: Vici vs Stroke

I ran into this Permission denied issue as well when using the older Stroke version of strongSwan. Forcing AppArmor to use the older ABI should work with either version.

So, does it work in the end?

To verify that strongSwan is working properly and can establish security associations between my two Proxmox nodes, I deployed a simple test setup.

I clustered two Proxmox nodes and created a simple VXLAN SDN zone:

In both nodes, I set up a simple configuration for strongSwan in /etc/swanctl/swanctl.con:

connections {
  vxlan {
    proposals = aes128-sha256-modp3072
    remote_addrs = 10.0.0.65 # Set to the other node's IP
    encap = yes
    local {
      auth = psk
    }
    remote {
      auth = psk
    }
    children {
      net-net {
        esp_proposals = aes128-sha256
        remote_ts = 0.0.0.0/0[udp/4789]
        local_ts = 0.0.0.0/0[udp/4789]
        mode = transport
        start_action = start
        updown = /usr/lib/ipsec/_updown iptables
      }
    }
  }
}

secrets {
  ike {
    secret = SECRET
  }
}

After applying it via swanctl --load-all, here is the result when running swanctl --list-sas:

StronSwan can now establish IPsec connections between the two nodes without AppArmor denials, and I can keep working on the IPsec configuration further to secure the VXLAN communications.

TLDR

Why is this happening?

AppArmor ABI broke some Unix networking options.

How do you fix this?

Switch to the older ABI:

echo override-policy-abi=/etc/apparmor.d/abi/4.0 | tee -a /etc/apparmor/parser.conf

systemctl restart apparmor.service # (optional) This is to make sure AppArmor is reloaded properly
apparmor_parser --purge-cache # Needed on my end because my setup won't refresh swanctl's profile
apparmor_parser -r /etc/apparmor.d/usr.sbin.swanctl

Thoughts? Comments? Questions?

Feel free to leave a comment!

Building a Self-hosted IAM Platform to Add SSO to My Home Lab

Patima Poochai — Mon, 19 May 2025 08:00:51 +0000

A growing problem with my home lab is password fatigue. Each time I add a new service to my network, I generate a new random password for it. I got by initially with using a different password for every service by storing them in a password manager, but as the number of services exceeded double digits, having to open my password manager every time I want to access each service is starting to hinder my productivity. I needed a self-hosted SSO solution for my home lab services, so I deployed an open-source IAM platform that supports SSO via OIDC, OAuth, and LDAP protocols to my Kubernetes cluster.

Walkthrough

The initial design was simple. I only needed SSO for my services, namely Grafana and Syncthing, and my solution must support both modern protocols like OAuth/OIDC and legacy protocols like LDAP. I used IBM's IAM architecture video as an inspiration to draft a simple IAM stack using a modern identity management system that integrates with a single, centralized LDAP directory store.

The architecture of an IAM platform is made up of 3 layers: the base infrastructure layer, the application layer, and the connection layer. The base layer is composed of a directory store, a repository for identity information, and synchronization, the ability for multiple directories to share identity information with each other. There were many self-hosted LDAP directory servers available, like the 389 Directory Server and FreeIPA, but I chose LLDAP to be the centralized directory store because of its simple configuration and low resource usage.

The application layer contains the software that implements IAM workflows like administration, access management, and roles. For this layer, I used Keycloak to provide the functionality of an SSO system, like a user login interface and SSO redirection. The connection layer deals with identity federation across multiple IAM platforms, but because the scope of my project is only to deploy an SSO provider to my home lab network, implementing this layer is unnecessary.

I started by setting up a directory store using an LLDAP server. LLDAP is a minimal LDAP server with very basic features, and it can only integrate with a limited range of LDAP services. I deployed the LLDAP server alongside a Keycloak instance using Helm and configured a read-only federation with Keycloak to synchronize the users in its directory with the users from the LLDAP directory. LLDAP also doesn't support modern authentication methods like OAuth and OIDC, so setting up federation allowed users within the LLDAP directory to authenticate with applications that don't support LDAP.

Federation also allowed me to manage identities across both tools in one centralized directory store, as changes in the LLDAP directory will also be reflected in Keycloak's directory. This configuration created the least management overhead while providing the widest compatibility for both legacy applications using LDAP and modern applications using OAuth and OIDC.

The IAM platform is pretty much complete, and now I only need to configure my applications to use the platform as the SSO provider. First, Grafana supports OAuth/OIDC authentication, so I set Grafana as a "client" of Keycloak and configured it to redirect any user sign-in requests to Keycloak's authentication page. After the user signs in with their Keycloak account, Grafana will receive the ID and access tokens from Keycloak that contain the user's identity information.

Other applications that don't support OAuth/OIDC authentication, like Syncthing, can use LLDAP as the SSO provider directly. It is less secure to configure SSO authentication with Syncthing, as LLDAP uses the admin account directly to query the identity information for each login.

Final Thoughts

This project was a great introduction to the IAM architecture and its security protocols, and it will help me improve my future projects. There are certain side projects that are too small to use an enterprise IAM solution, while also being too big to not implement some form of user authentication. One project that comes to my mind is the 5D Diplomacy With Multiverse Time Travel game. It's a web game that was released as a self-hosted project without user authentication initially, and it created a huge barrier to entry for non-technical players who'd rather have a public instance of the game where they can quickly try out the game.

Projects like this would have lowered the barrier to entry and gained a lot of visibility by having an IAM system, yet it's a huge pain to write your own authentication code. Learning how to use off-the-shelf IAM tools like Keycloak and LLDAP saves you a lot of time from having to reinvent the wheel while also using industry-standard protocols that enable you to migrate your projects to an enterprise IAM solution in the future.

View my project on Github

What I've Learned From Troubleshooting OIDC/OAuth SSO Errors in Grafana and Keycloak

Patima Poochai — Wed, 07 May 2025 06:38:41 +0000

I'm working on adding a simple IAM stack to my home lab to enable SSO for my services like Grafana. I configured the Grafana instance to allow for SSO authentication using Keycloak as the provider and OIDC/OAuth as the protocols. When I clicked the SSO option on the Grafana login page, however, I got an error after it redirected the browser to Keycloak's sign-in page.

The page said that the redirect_uri was invalid, and online forums described this error as being caused by the Valid redirect URI list in Keycloak not matching the URL of the Grafana instance. The instance is being accessed on http://nextcloud.home.internal:30007, and when I check the redirection URI list on Keycloak, this URL is included in the list. If this isn't the cause of this error, then what is?

Maybe there's a clue in the browser. I tried checking the console output after clicking the SSO login button to see if there was something wrong with the login flow in the front end. I found that the URL of the request included a query string redirect_uri that contains the source of the redirection (which should be set to the URL of the Grafana instance), and it was set to http://localhost:3000. The source of the request obviously doesn't match the valid redirection URIs that were set in Keycloak, but I don't see anything in the Grafana OAuth settings that would make the redirection use this URL.

Maybe I should look for a setting that is set to http://localhost:3000 somewhere outside of the OAuth configuration page. I browsed all the possible settings within Grafana and found something in the General settings.

The domain and http_port settings in this page has a similar value to the redirect_uri query string in the error, and both of these values are combined inside the template of the root_url setting.

Jackpot. It seems like when Grafana redirects a user to an identity provider's SSO page, the value inside root_url will be used as the redirect_uri query string. Therefore, this value must match the publically accessible URL of the Grafana instance and be included in Keycloak's valid redirection URI list. Otherwise, the user will get a invalid parameter: redirect_uri error.

The fix is simple. I modified Grafana's configuration file called defaults.ini to change the domain and http_port values so that, when they are combined in the root_url value, they match the public URL of my Grafana instances at http://nextcloud.home.internal:30007.

The Grafana instance was running in a pod set to expose port 3000 initially, and I found that I had to change the pod to expose port 30007 instead to allow Grafana to work properly. Grafana will run the service inside the container on the same port number as the one set in the http_port setting, so you have to adjust the port that the pod will expose accordingly.

After a quick restart of the Grafana deployment, I tried using the OAuth SSO option when logging into Grafana again. And voila!

I'm able to use the Keycloak login page to authenticate with Grafana. I could then log in to Grafana using the credentials of a user stored in the Keycloak database.

Because this user is stored in the Keycloak database, I could then centrally manage this user in the Keycloak administrator GUI.

It was a surprise to learn that the root_url value is used as the redirection URL for Grafana because this fact wasn't mentioned in the Grafana OIDC/OAuth documentation. However, this was still an interesting experience that gave me new insights into how Grafana manages OIDC/OAuth SSO redirection.

Building a declarative home lab using K3s, Ansible, Helm on NixOS and Rocky Linux

Patima Poochai — Mon, 31 Mar 2025 01:32:01 +0000

I set up a home lab to run services like Prometheus, Grafana, and more on K3s and Docker. I deployed these services using Helm and configured the operating system with Ansible. These services are running on two Beelink mini PCs using NixOS and Rocky Linux for the operating system layer.

Core Features

My home lab provides the following functionality:

Observability with system health and performance monitoring from Prometheus and dashboard visualization from Grafana
Configuration management and automation using Ansible
Containerized services orchestrated with the K3s flavor of Kubernetes
Declarative containers deployment using Helm and Docker Compose
Declarative operating system configuration with NixOS
Version control system hosting local code repositories on Forgejo
File storage server with Nextcloud and file synchronization with Syncthing
Self-hosted application launcher using Heimdall

Pain of the Manual

I love working on infrastructure deployment and operations. Before this project, I used to tinker with smaller labs using Linux servers like Proxmox and Ubuntu. I also love learning about new open-source, self-hosted services that could boost my productivity and expand my skill set.

This time, however, I had one new objective: to have a home lab that is fun to manage. There was a lot of pain with my past labs due to how my devices were configured manually. When something (inevitably) goes wrong, I often don't have the time to dig through the logs, trace each step, and figure out what happened. I'd rather reset everything to have the service be online again; prioritizing availability.

However, this preference came with a big cost because I'd have to configure everything by hand again. It was discouraging to retrace my steps and relearn how to apply the same settings over and over again. It also limited the scope of my home lab because the risk of having to reapply the same configurations increases the more I add new devices and experiment with new services.

So I came up with a plan: I'm going to use as much X-as-code technology as I can for this project. I wanted to use Kubernetes. I wanted to codify as much hands-on Linux configuration into Ansible as possible. I even went as far as planning to use NixOS, an experimental Linux distribution where every system configuration is in code. This goal became a major influence on how I chose the technology I used for this project.

Technologies Used

Observability tools

Deployment and configuration management tools

Helm
Ansible

Containerization and orchestration

Kubernetes with K3s
Docker Compose

Operating system layer

Why I chose these technologies

This project got started when I found the Beelink mini PCs (not sponsored) while browsing Amazon. I've seen some people recommending these mini PCs due to their low power consumption, decent performance, and relatively cheap price ($139 at the time of writing). They would be the perfect bare-metal layer for my lab environment as I can buy and set up as many of them as I want without breaking the bank.

I chose to use Rocky Linux as the OS for my first machine, as it's a free version of Red Hat's RHEL, and I'm the most familiar with RHEL-based distributions from the time spent studying for the RHCSA certification. The second machine will run NixOS as a way for me to get started with learning the declarative approach to Linux server configuration. NixOS is one of the most declarative ways to run your infrastructure, so it's the perfect tool to make as much of my lab into code as possible.

The decision to use mini PCs as the host machines heavily affected the tool I chose for the container runtime and orchestration layer. As each machine has a decent, but limited performance, K3s is the only choice to run a Kubernetes cluster because it has a low resources requirement. I didn't consider using Helm with K3s at first, but I started using it midway through the project because it was getting too difficult to manage the Kubernetes manifests for my services.

However, I still wanted a few services to be running on Docker using Docker Compose. These services are essential for my day-to-day productivity, and I wanted to them be accessible even if K3s is down.

For transcribing manual Linux configuration into code, I turned to Ansible because it's compatible with both Rocky Linux and NixOS while having an easy-to-learn syntax. It also doesn't require an agent to be installed on each host, so I can provision new machines without setting up an OS imaging pipeline.

By the end of the project, I got interested in adding a monitoring stack to my lab from listening to The Pragmatic Engineer's podcast episode on Observability with Charity Majors. I wanted to take the simplest path to provide observability, so I chose the Prometheus + Grafana stack.

Highlights

Wi-Fi bridge networking

First, I started with the networking. I had an issue where there were no direct ethernet ports that could connect my home lab to the internet, so I set up a Wi-Fi extender as a bridge and used its ethernet output to create a nested network within my home. The traffic between the devices in the lab will be routed internally, while the internet-bound traffic will be forwarded over the Wi-Fi extender. View more details about this process in my blog post.

Configuration management and automation

I then installed Rocky Linux on my first machine and forced myself to only use Ansible to manage the configuration of the first machine.

I wrote an Ansible playbook to set up the host to be managed by Ansible. Some of the tasks that I wrote into the playbook are: creating an SSH key pair, uploading the key to the host, and setting up privilege escalation for the Ansible user account. I also wrote a playbook to optimize the system's power consumption by using the tuned package to apply battery-saving settings to the system.

It took quite a while to write the playbooks to get the machine ready for running container services. However, it was worth the effort because if the machine breaks and needs a reset, I can just rerun the Ansible playbook again, and it will be up and running in no time. I don't have to do anything by hand as all of the configuration to prepare the system for running services is defined in the configuration management tool.

Productivity and essential services

There were a few services that were essential for my productivity, so I used Docker Compose to deploy them. I wanted services like Forgejo, Nextcloud, and Syncthing to be available at all times even if K3s is unavailable. I particularly have the least tolerance for downtime with the Forgejo service, as it will be hosting Kubernetes manifests and Ansible playbooks that are required to deploy everything else in the future.

I wrote Ansible playbooks to automate the deployment of the Docker Compose services to the Rocky Linux host. Some configurations can't be set within the Docker Compose files, like adding firewall rules to the host to open the necessary ports for each service, so those settings are applied by the playbook. This extra automation layer makes the whole process declarative and written as code.

Container orchestration

Ansible was also used to deploy K3s on the Rocky Linux host as the container orchestration layer. While the installation process for K3s is very simple (just running a shell script) the software required a few special firewall rules to classify the traffic originating from the container network as "trusted." Adding these rules only required a few commands, but I still worked them into a playbook. The time it took to run a few commands manually will add up fast when you have to set up multiple nodes in the future.

Declarative containers

I didn't plan to use Helm initially. Helm wasn't included in the curriculum for CKA when I took it, so I assumed that it wasn't needed if you're doing SysOps work (like setting up a home lab infrastructure). I also thought that it would be difficult to learn and that the time saved from using this tool would be too small. However, it was getting more difficult to manage the ever-growing number of Kubernetes manifests, so I decided to pick up Helm. It turned out to be a simple tool to learn, and it made the deployment process faster by bundling all of the manifests into one package.

The first service I deployed with Helm was Heimdall, and it provides a central place to host the shortcuts to all of the self-hosted services in my lab.

Metrics, monitoring, and visualization

Following the discovery of Helm, I wrote the service manifests for the Prometheus and Grafanna services in Helm and used Ansible to automate their deployment to my K3s cluster. I also wrote a playbook to deploy the Prometheus node exporter's binary, upload the configuration files, and open the firewall ports for metrics collection.

Using the data collected from the node exporter, I could visualize each system metric in a dashboard by connecting Grafana to the Prometheus server.

Troubleshooting Prometheus

While working on deploying Prometheus, I ran into an open /prometheus/queries.active: permission denied issue. After a few days of troubleshooting, I found that Prometheus required specific permission to be set on the host directory if it's being bind-mounted to the container. I fixed this issue by adding a task to the Prometheus deployment playbook to change the owner of the host directory to the same user as the one in the container. View how I troubleshoot and fix this issue on my blog post.

I also applied this lesson when writing the Helm chart for Grafana by setting the host directory's owner to have the same UID as the one defined in Grafana's Dockerfile.

Declarative OS configuration

While I had most of the services that I wanted running on the Rocky Linux host, I also wanted to learn declarative Linux configuration, so I provisioned another Beelink mini PC with NixOS. With this OS, you don't need to install extra configuration management tools as the operating system is already declarative, allowing you to configure every system setting from a single code file.

I wrote the NixOS configuration files to do the following: configure the host to be managed by Ansible, install the K3s as an agent with a dynamic IP address resolution for the K3s server node, install the Prometheus exporter, and expose the exporter's metrics endpoint to other nodes. I then wrote an Ansible playbook to push these files to the NixOS host, test the configuration, and rebuild the operating system from the configuration.

Using NixOS was a huge time saver. I estimated that it took me about a week of spending time after work and weekends to set up Ansible management, K3s, and Prometheus exporter for the Rocky Linux host. With NixOS, it took about 3 hours to set up the same three configurations.

While learning how to set up the host for Ansible management, I had a hard time understanding how the NixOS sudoers file worked as the documentation was sparse. After taking a few days to learn how sudo is implemented in NixOS, I compiled my notes and published them as a guide on dev.to. View my guide here.

Lessons Learned

Reflecting on the tool choices

The Prometheus + Grafana stack is a good introduction to observability, and I want to dive deeper into this topic by replacing Prometheus with OpenTelemetry sometime in the future.

Choosing K3s and Helm (later on) was a great tooling choice. I love how K3s make the process of getting a Kubernetes cluster up and running very easy. Helm has also become an essential tool for deploying services with Kubernetes, and I'm hoping to learn how to deploy services using a Helm repository in the future.

Ansible was also a great tool for deploying and managing my servers, but I also wished it was better. The tool was a step above writing bash scripts by providing idempotency and a large library of ready-to-use modules. However, I can't easily undo the changes after they were applied. If I make a mistake in my code, I can't just delete the offending line and rerun the playbook like how you would fix the same mistake in a modern DevOps tool like Terrafom.

These issues, however, are non-existent on NixOS. The configuration files behaved like a modern cloud infrastructure tool, where rollback is as simple as removing a line and rebuilding the operating system. Ansible can be added to existing systems to manage them as code, but in NixOS, the system is already code. I'm looking towards learning Nix flakes in the future to make my configuration even more reproducible across any device.

So, still going back to manual configuration?

Going forward, I'm going to double down on always turning manual configurations into code. It just makes everything about building a home lab painless and fun. Have a broken service? You can just re-run the Helm or Docker Compose deployment. The host machine is broken with no obvious solutions? Just wipe it and run the Anisble playbooks to rebuild your machine automatically without having to spend hours setting it up manually again.

As long as I have the code to rebuild my lab, I won't feel anxious for my setup even if you decide to toss my devices into the ocean. (I would be confused as to why you would do that, but not anxious).

Check the infrastructure code for this project on Github.

How I built a home lab without an ethernet port

Patima Poochai — Thu, 27 Mar 2025 05:41:55 +0000

In this blog, I will show you how I set up my home lab network without a direct ethernet port (also known as a network drop). I used a WiFi range extender as a bridge to a SOHO (small office home office) family router that converts WiFi signals into an ethernet port output. I then connect the WiFi extender to an OPNsense router to provide internet access to the devices without WiFi capabilities while also routing the traffic within the network using a dedicated, wired connection. I also connected another SOHO router to the OPNsense router as an AP (access point) to connect wireless devices to the network.

What scenarios would be best suited for this setup?

This setup is great for scenarios where you don't have a direct ethernet connection between your router and your home lab, and you can't install new network drops to connect them. In my case, my house has a SOHO gateway router that is installed several rooms away from my home lab, and I can't make modifications to my living space, like drilling into the ceiling or through a wall, to run an ethernet cable to my setup. Other cases that would benefit from this setup are when you're renting, dorming, or living in a place that restricts modifications to the plenum space or the structure of the space to install new ethernet ports.

As the demarcation point for my house is located near the front of the building, the SOHO router that provides the WiFi connection for my household is very far from where my devices are located, and the only way I could get a wired connection between them is by running the cable through the ceiling.

Why I chose to set up another router

While I can still run my home lab devices off of the SOHO router, there are a few reasons why I wanted a better setup:

Some home lab devices don't have a good WiFi antenna - Some of my devices don't have a powerful enough WiFi card, and as my router is very far away, they can't get a stable network connection. Some home lab equipment that I plan to install in the future doesn't even have WiFi capabilities (like the AML-S905X-CC).
Network-heavy tasks will flood the WiFi network - Some of my services generate a large amount of network traffic, and it can create congestion on the WiFi network and slow down the internet connection. Since the SOHO router is used by everyone in my family, I can't have my network-heavy tasks running on it. The limited bandwidth of WiFi can also create a bottleneck for high-bandwidth network tasks like NAS (Network Attached Storage) servers and media streaming services.

I wanted to set up a dedicated WiFi extender that consolidates many weak WiFi signals into a single, high-performance connection to my router. I also wanted to keep network-heavy traffic off of the WiFi network and route them via a wired connection instead. By creating a setup to convert the WiFi signals into ethernet, I can get the benefit of having a wired connection for my home lab without spending money and time to set up a direct physical connection to my router.

If you want a stable, wired connection for your home lab devices, you can follow along with the steps I took to implement this setup below.

Setup the physical devices

Gear list

These are the devices I used for this setup:

WiFi range extender: TP-Link re220

Router computer: Zimaboard

Switch: TP-Link TL-SG108E managed switch

You don't have to specifically use these devices. You can use anything that's the best fit for your use case, as long as they have the necessary features. For the WiFi extender, the key feature you're looking for is that the device has an ethernet port output. For the router computer, you're looking for a device that is compatible with your desired router OS and has two ethernet ports. The switch doesn't even have to be managed, I went with the TL-SG108E switch because I want to configure VLANs and other network configurations in the future.

Setup the WiFi extender

The first step is to set up the WiFi extender to connect it to the SOHO router over WiFi. The individual steps to do so can vary depending on the device you use, so it's best that you follow your device's instructions or manual.

In my case, I connect my laptop to the WiFi extender via its ethernet port to access the device's web interface. Then, I entered the SSID of my home network and its WiFi password, and the extender can use this information to connect to the SOHO router's WiFi network.

You should know when the WiFi extender is connected when you connect a computer to the extender's ethernet port, you are able to get internet access from your device. I just ran a quick ping command and I was able to get replies back.

After setting up your WiFi extender, here is how your network should look like:

Set up the router and the switch

Here is how I set up the other devices:

There are a few key points to this configuration:

The ethernet cable that is coming from the WiFi extender should be plugged into the WAN port of your router. In my case, the OPNsense router can use any port as the WAN interface, so I plugged it into the re0 interface.
Then, you can connect the other interface (re1) on your router to the switch. The router will provide internet access and LAN routing via the switch, so you can connect any home lab devices you have to the remaining ports of the switch.
Lastly, you should place your WiFi extender as close to your home router as possible and point it in the SOHO router's direction. This will ensure that you get the best possible connection between your home router and your home lab network. For my setup, I mounted my WiFi extender on a gooseneck holder, and I pointed it in the direction of the SOHO router.

Configure the networking layer

The general steps

I went with OPNsense as the router operating system for this project, but if you don't want to use OPNsense, you can still follow along with this setup. The workflow might be different, but the general step of how you should configure the router should be the same:

Set the interface connected to the WiFi extender as the WAN interface
Set the interface connected to the switch as the LAN interface
Make sure a DHCP server is active on the LAN interface

Configure OPNsense

1. Install and start OPNsense according to the documentation

In short, you would flash the installation image onto a USB drive, boot from the USB drive, log into the installation account, and follow the installation wizard.

2. Take note of which interface is connected to the WiFi extender and which is connected to the switch

If you recall the physical devices I set up earlier, I connected the WiFi extender to re0 and the switch to re1.

3. Assign the WAN interface to the port connected to the WiFi extender (`re0`) and set the IP address of the WAN interface to DHCP

This will make the OPNsense router take whatever IP the SOHO router assigns to it.

4. Assign the LAN interface to the port connected to your switch and use a static IP address for the LAN interface

In OPNsense, setting the static IP address of the LAN interface will also affect the network address you can use for your network. That means you should set the LAN interface's IP to the first, non-network address of your desired CIDR (Classless Inter-Domain Routing) range.

For example, I want my home lab network to have a CIDR range of 10.0.0.0/24, so I would set the static IP address off the LAN interface in OPNsense to be 10.0.0.1.

You should also make sure that the IP range on the LAN interface doesn't overlap with the IP range of your SOHO network, as setting them to overlap will cause routing issues on your home lab network. For example, if my SOHO router uses 192.168.1.0/24, I cannot set my home lab router to use 192.168.1.0/24 as well. However, 192.168.2.0/24 will not overlap with the SOHO router.

5. Set up a DHCP server that will run on the LAN interface.

In OPNsense, you set the lower and upper bound of the IP assignment range. Since my network address is 10.0.0.0/24, the valid client address range would be between 10.0.0.2 and 10.0.0.254.

After assigning interfaces and setting up the DHCP server, the WAN interface of the OPNsense router will be assigned an IP address within the SOHO router's WiFi network; making the router appear as just another device on the WiFi. However, the OPNsense router will also serve as the internet gateway for your home lab devices that are wired to the switch. The traffic within the 10.0.0.0/24 network will only pass through the switch, while the internet-bound traffic will be forwarded to the OPNsense router, which will also forward said traffic to your SOHO router over the WiFi connection.

Setup an access point with another SOHO router

We have set up the home lab network to forward wired traffic to the SOHO router over WiFi, but wireless devices cannot connect to the home lab network as is, as the OPNsense doesn't have a WiFi card. We can fix this issue by adding an AP device that is plugged into the switch. It will forward the traffic from your wireless device to the OPNsense router over WiFi, serving as the bridge between your wireless devices and your wired home lab devices.

Like before, you can use any AP device as long as it has an ethernet port. In my case, I already have a used ASUS RT-AC68U SOHO router.

I simply connect an ethernet cable from the switch to the WAN port of the ASUS router. The router detects that it's on a network that's already managed by another router, and automatically switches to AP mode. This mode will make the ASUS router act as an AP, meaning that it will only forward requests between the wireless device and your home lab router without creating its own network. If your device doesn't automatically switch into AP mode, you can manually set the router to use this mode using its management UI.

The network should now look like this:

WiFi channels overlapping issue

There might be an issue from the close proximity between the WiFi extender and your AP, as the WiFi signal emitted by both devices can overlap with each other, causing signal collisions and slowing down the connection speed in your network.

I fixed this issue by setting the AP to use a different WiFi channel frequency than the WiFi extender. WiFi devices can receive and transmit signals within certain ranges of channels, and you can prevent signal collisions by setting each device to use a non-overlapping channel.

Using WiFiAnalyzer from F-droid, I saw that the WiFi extender is using channels 34-50 on 5GHZ frequency and channels 1-3 on 2.4GHZ frequency. If my AP also uses the same range of channels, it would have to compete with the signals from the WiFi extender and slow down the network.

From the WiFi Analyzer, it seems like channels 163-167 on 5GHZ and channels 9-11 on 2.4GHZ are the least saturated, so I set the AP to use those channels on 5GHZ and 2.4GHZ.

With this change, both devices should be using different channels, allowing them to work in harmony despite the close proximity.

Conclusion

Takeaways

By adding a few pieces of networking equipment, you can get an internet connection over WiFi while still providing a wired connection between devices and keeping network-heavy traffic off of the air.

You can connect a WiFi extender to the SOHO network, and convert that WiFi signal into an ethernet port output. With this port, we can then set up a home lab network with OPNsense that provides a fast wired connection between devices in your home lab network.

If you need to connect wireless devices to your home lab network, you can set up an AP that forwards the traffic to your switch. We can then minimize the disruption from the close proximity between the AP and the WiFi extender by configuring both devices to use different WiFi channels.

Result

After finishing this setup, here is how the internet speed would look like on a device using the SOHO WiFi:

78 Mbps, pretty good.

Here is my internet speed using the wired connection:

73.92 Mbps on the wire. While it is slower than just using the SOHO router's WiFi directly, all of the internal traffic between home lab devices is kept off of the WiFi, so I'm content with ~5 Mbps drop in speed.

Here is my internet speed from my wireless device connected to the AP:

54 MB on wireless devices going through the AP. The speed drop and increased latency make sense here, as I'm using an older router and each packet has to hop across two different WiFi connections and three extra network devices.

See some improvements?

If you tried this setup, feel free to let me know what your result is. I'm only getting started with the home lab hobby, so if you have any suggestions for improvements, I'd appreciate any feedback in the comment.

How to Edit the Sudoers File in NixOS - with Examples

Patima Poochai — Mon, 24 Mar 2025 00:03:54 +0000

Intro

I wanted to configure the /etc/sudoers file in NixOS to setup an account that doesn't require password with sudo for Ansible management. However, the wiki page for sudo is a bit lacking, so here's everything I know about managing the sudoers file from trial and error and reading other sources.

Basics

If you don't know how to edit the sudo files normally, I recommend you read DigitalOcean's guide first.

Boilerplate

Here is how you would write the boilerplate to manage the sudoers file. You should put these lines in your configuration.nix file or in a Nix module that is imported into configuration.nix.

{ config, pkgs, ... }:

{
    security.sudo = {
        # place top level options (like wheelNeedPassword) here
        enable = true; # make sure to enable the sudo package
        execWheelOnly = false;
        wheelNeedsPassword = false;

        extraConfig = "#includedir /etc/sudoers.d"; # write custom config in here

        extraRules = [
            # place sudoers rules here
        ];
    };

    # place other configurations outside of the sudo package here
}

Key points:

You manage sudoers by setting the configurations inside the security.sudo module.
You put all of the sudoers rule in the extraRules property (there is no defaultRules property).
You can set other options, like disabling the password prompt for the wheel group, outside of the extraRules property.
You can view all possible options for the modules using the man page by running man configuration.nix and searching for security.sudo. You can also view the man page online on mankier.com.

extraRules Template

Here is a template of how you'd write a sudoers rule inside the extraRules property.

extraRules = [
    {
        users = [ "sudoers-example" ]; # apply this rule to this user
        # groups = [ "wheel" ]; # replace the line above with this line to apply the rule to groups
        host = "ALL"; # host portion of ALL=(ALL:ALL) (i.e. the "ALL=" part), optional
        runAs = "ALL:ALL"; # the "(ALL:ALL)" part in ALL=(ALL:ALL), optional

        commands = [ # takes in a list of commands
          "/run/wrappers/bin/passwd" # you can write the commands as only a string

          # or write more complex commands uses an attribute set
          {
            command = "ALL"; # this would be NOPASSWD: ALL
            options = [ "NOPASSWD" ]; # don't need the ":" at the end
          } 
        ];
    } 
];

Translating from normal configuration to NixOS sudoers

Here is how the normal sudoers rules can be translated into the NixOS configuration.

A few things to note:

The users field accepts a list of usernames as strings.
The commands field also accepts a list of commands as strings, and it will transform the list into a single line delimited by commas.
The runAs field doesn't require a parentheses.

Applying the configuration

After writing the NixOS configuration, there are two ways to apply it:

Apply the sudo rules to the system temporarily

nixos-rebuild test

Permanently apply the sudo rules to the system

nixos-rebuild switch

You could then test your configuration with these commands

Change into user account

sudo su - USERNAME

List the permissions that are assigned to the user with

sudo -l

Common tasks with examples

Here are some common sudoers configurations and how you can write them in NixOS.

Make a user become a member of the group wheel (fastest way to give privilege)

First create the sudoers-example user, equivalent to usermod -aG wheel user.

{ config, pkgs, ...}:

{
  users.users.sudoers-example = {
    isNormalUser = true;
    createHome = true;
    extraGroups = [ "wheel" ]; # add into wheel
  };
  ...
}

Then add the wheel group and give it root privileges, equivalent to %wheel ALL=(ALL) ALL.

{
    ...
    security.sudo = {
        enable = true;

        extraRules = [
            {
                groups = [ "wheel" ];
                commands = [ "ALL" ];
            }
        ];
}

If you don't add host = "ALL" and runAs = "ALL:ALL";, NixOS will set the host and runAs to ALL=(ALL:ALL) by default.

Fastest way to make the wheel group not prompt for a password

The fastest way to make the sudo command work without a password is to assign the user to the wheel group and set the security.sudo.wheelNeedsPassword property to true. I found this property from the NixOS forum.

security.sudo = {
    # remember the top level options?
    wheelNeedsPassword = false;
}

Short configuration to allow a user to run all commands as root

Equivalent to sudoers-example ALL=(ALL:ALL) ALL.

extraRules = [
    {
        users = [ "sudoers-example" ];
        commands = [ "ALL" ];
    }
];

Allow users in certain groups to run all commands as root

This is similar to the above rule, but swap the users property with the group you want and the commands to what you want. Equivalent to %administrator ALL=(ALL:ALL) ALL.

extraRules = [
    {
        groups = [ "administrator" ];
        commands = [ "ALL" ];
    }
];

Allow a user to use sudo for a specific list of commands

Equivalent to sudoers-example ALL=/usr/bin/useradd, / usr/bin/passwd.

Caveats:

You cannot use the normal Linux path for commands, like /usr/bin/useradd for useradd.
This is because NixOS stores the packages in an alternate location, called the Nix store. You have to use the package's path from said store, and you can't use the usual path, like /usr/bin/passwd. A quick and dirty workaround for me is to just run which COMMAND first to get the package's path for NixOS.

$ which passwd
/run/wrappers/bin/passwd

extraRules = [
      {   
        users = [ "sudoers-example" ];
        commands = [ 
          {   
            command = "/run/current-system/sw/bin/useradd";
          }   
          {   
            command = "/run/wrappers/bin/passwd";
          }   
        ];  
      }
];

Exclude specific commands

This configuration allow the user to change the passwords for all users, but restrict it from changing the root user's password, equivalent to sudoers-example ALL=/usr/bin/passwd, ! /usr/bin/passwd root. Remember to run which COMMAND first to find the path of the command.

extraRules = [
      {   
        users = [ "sudoers-example" ];
        commands = [ 
          {   
            command = "/run/wrappers/bin/passwd"; # you can run passwd on any user
          }   
          {   
            command = "! /run/wrappers/bin/passwd root"; # but can't run passwd on root
          }   
        ];  
      }
];

Allow a user to run all commands without a password

Equivalent to sudoers-example ALL=(ALL:ALL) NOPASSWD: ALL. Notice how the tag_spec name (NOPASSWD) doesn't require an : at the end.

extraRules = [
    {   
        users = [ "sudoers-example" ]; 
        commands = [ 
          {   
            command = "ALL";
            options = [ "NOPASSWD" ]; # don't need the ":" at the end 
          }   
        ];  
    }
];

Require a password for all commands, but no password for certain commands

Equivalent to sudoers-example ALL=(ALL:ALL) PASSWD: ALL, NOPASSWD: /usr/sbin/modprobe. The user can need to enter their password for all commands except modprobe.

extraRules = [
    {   
        users = [ "sudoers-example" ]; # applies the first column of the sudoers line
        commands = [ 
          {   
            command = "ALL";
            options = [ "PASSWD" ];
          }   
          {   
            command = "/run/current-system/sw/bin/modprobe"; # allow loading and unloading of kernel modules
            options = [ "NOPASSWD" ];
          }   
        ];  
    }
];

Prevent commands from spawning subcommands

You can bypass sudo's authorization process by running an allowed command, then triggering the command to spawn a subcommand with the root privileges that was previously blocked by sudo. From the DigitalOcean article, for example, you can run less with sudo but also spawn a bash shell within it that has root privileges.

You can prevent users from spawning subcommands using the NOEXEC tag_spec in sudo.

extraRules = [
    {   
        users = [ "sudoers-example" ];
        commands = [ 
          {   
            command = "/run/current-system/sw/bin/less";
            options = [ "NOEXEC" ]; # apply a tag_spec that prevent spawning child processes
          }   
        ];  
    }
];

Now you can't execute other commands in less by typing ! COMMAND

Create sudoers aliases for user groups, commands, and run-as

Aliases are a feature of sudo that's similar to a local variable; a single name that refers to a list of items. There's no property in NixOS that can specifically set the User_Alias, Cmnd_Alias, Runas_Alias aliases, but you can use the extraConfig property to set aliases with custom texts. NixOS will then append the lines from the property into the sudoers file.

security.sudo = {
    enable = true;

    extraConfig =
    ''  
    User_Alias    ADMINGROUP = sudoers-example # define aliasses here
    '';

    extraRules = [ 
      {   
        users = [ "ADMINGROUP" ]; # will resolve to sudoers-example
        commands = [ "ALL" ];
      }
    ];
}

Other aliases should work too.

    extraConfig =
    ''  
    User_Alias    GROUP = user1, user2
    Cmnd_Alias    KERNEL = /run/current-system/sw/bin/modprobe, /run/current-system/sw/bin/modinfo
    Runas_Alias   VIRT = kvm
    '';

Set other settings of the sudoers file

If you want to add custom configurations that aren't implemented in NixOS's sudo module, you can also use the extraConfig property. For example, if you want to add /etc/sudoers.d as a drop-in configuration directory where sudo will search for extra configurations files, then you can add a multi-line string in the format of the normal sudoers configuration language to the extraConfig property.

security.sudo = {
    extraConfig =
    ''  
    #includedir /etc/sudoers.d
    '';
}

Closing Thoughts

Configuring sudo in NixOS might be confusing at first, but you can master the process easily if you practice writing a few sudoers rules and reference the man page of the configuration.nix file by running man configuration.nix.

Going back to the purpose of this blog post, we can now write the configuration.nix file to create a user called ansible and allow this user to use sudo without asking for the password like so:

{ config, pkgs, ... }:

{
  # create "ansible" user
  users.users.ansible = {
    isNormalUser = true;
    home = "/home/ansible";
    openssh.authorizedKeys.keys = ["ssh-rsa PUBLICKEY"];
  };

  # set up sudo to not ask for a password
  security.sudo = { 
    enable = true;
    extraRules = [ 
      {   
        users = [ "ansible" ];
        commands = [ 
          {   
            command = "ALL";
            options = [ "NOPASSWD" ];
          }
        ];
      }
    ];
  };

}

How to fix Prometheus "open /prometheus/queries.active: permission denied" on Kubernetes: step-by-step

Patima Poochai — Mon, 17 Mar 2025 04:34:08 +0000

I learned how to diagnose a new Prometheus + Kubernetes issue today, here's a summary of what I did.

Context

I'm trying to install the Prometheus monitoring tool to my k3s cluster using a Helm chart, and I want to store the metrics data in a volume that is mounted to the /prometheus directory inside the container. I created a volume that is mounted locally to the /home/ansible/prometheus/data directory on host machine using the rancher.io/local-path storage class.

The PersistentVolume and Deployment manifest would look something like this:

apiVersion: v1
kind: PersistentVolume
spec:
    storageClassName: local-path
   local:
         path: /home/ansible/prometheus/data # directory that is mounted on the host
...

apiVersion: apps/v1
kind: Deployment
...
  template:
    spec:
      containers:
      - image: {{ .Values.image.repository }}/prometheus:{{ .Chart.AppVersion }}
        volumeMounts:
        - mountPath: /prometheus # the volume will be mounted here in the container
          name: prometheus-pvc
...

These manifests createe a volume that stores the data on the host computer running k3s, and mount that volume to the /prometheus path inside the container.

The Problem

The Helm chart installed without an issue, but I get this error when checking the status of the Prometheus pod:

NAME                        READY   STATUS             RESTARTS       AGE\
heimdall-788f5f64c8-mh2lb   1/1     Running            1 (4d4h ago)   4d4h\
prom-tst-6885c4dc8f-kzzmc   0/1     CrashLoopBackOff   4 (56s ago)    2m19s

I looked at the logs of the pod:

What caused the pod to be in the CrashLoopBackOff state was the error: open /prometheus/queries.active: permission denied.

Seems like Prometheus couldn't create the necessary files inside the /prometheus directory inside the container due to it lacking the necessary permissions. It reminded me of a section from an RHCSA book that explained how you can't bind mount a directory inside the container to a directory on the host if it doesn't have the correct permissions and the correct UID for the user inside the container. Since Prometheus' volume is mounted locally to the host machine, this section of the book might have something to do with why Prometheus is getting this error.

The book recommended that I run podman unshare to show the UID of the user inside the container, and set the owner of the directory on the host to have the same UID, but I don't know the equivalent command in Docker to get the container user's UID.

I tried looking through Prometheus' Dockerfile on Github to see if I can find the UID of the container user somewhere, but I found something else instead:

Looks like the user account of the container is set to nobody, and it sets the /prometheus directory inside the container to be owned by nobody. Maybe the container still expects that directory to still be owned by nobody when its being used, but who is the owner of that directory on the host machine now?

[ansible@nextcloud prometheus]$ ls -la
total 4
drwxr-xr-x.  3 ansible ansible   18 Mar 13 19:08 .
drwx------. 13 ansible ansible 4096 Mar 13 19:08 ..
drwxr-xr-x.  4 ansible ansible   70 Mar 13 19:15 data

Inside the /home/ansible/prometheus directory, the data directory is owned by the user ansible. Just a hunch, but I can try changing the directory owner and group owner on the host machine to be nobody, and remove the failing pod.

[ansible@nextcloud prometheus]$ sudo chown -R nobody:nobody data
[ansible@nextcloud prometheus]$ ls -la
total 4
drwxr-xr-x.  3 ansible ansible   18 Mar 13 19:08 .
drwx------. 13 ansible ansible 4096 Mar 13 19:08 ..
drwxr-xr-x.  4 nobody  nobody    70 Mar 13 19:15 data
[ansible@nextcloud prometheus]$

localhost@computer ~/P/Homelab> k get po
NAME                        READY   STATUS             RESTARTS        AGE
heimdall-788f5f64c8-mh2lb   1/1     Running            1 (4d4h ago)    4d4h
prom-tst-6885c4dc8f-kzzmc   0/1     CrashLoopBackOff   8 (4m28s ago)   20m
localhost@computer ~/P/Homelab> k delete po prom-tst-6885c4dc8f-kzzmc  
pod "prom-tst-6885c4dc8f-kzzmc" deleted

And voila, that configuration seems to be what Prometheus needed:

localstoat@thinkpad-e495 ~/P/Homelab> k get po
NAME                        READY   STATUS    RESTARTS       AGE
heimdall-788f5f64c8-mh2lb   1/1     Running   1 (4d4h ago)   4d4h
prom-tst-6885c4dc8f-hghv7   1/1     Running   0              68s

And Prometheus is now accessible by accessing the FQDN of the host on the NodePort of the service:

Key takeaways

If you're getting a permission denied on a Kubernetes deployment that is using a local-path storage class, you have to make sure that the owner of the directory that is used by the PersistentVolume should be the same as the user inside the container, as well as having the necessary permissions. Otherwise, you'll see Prometheus is getting a permission denied error while trying to write to that directory.
rancher.io/local-path is really just using directory bind mount under the hood, and the best practices from Linux administration to make the host machine work correctly with bind mounts can still be applied in Kubernetes.
You can look inside the service's Dockerfile to get information about the user and directory permissions of the container.

Extra: codifying this troubleshooting into configuration management

To make this whole troubleshooting journey worth it, I modified my Ansible playbook that is used to deploy the Prometheus Helm chart to automatically set the directory owner to be nobody. If I have to reinstall the Prometheus Helm chart again, my configuration management tool will apply this fix automatically without manual intervention:

- name: Setup prometheus with helm
 hosts: rocky,&prometheus
 tasks:
 - name: add directory for bind mount volumes
   file:
     path: /home/ansible/prometheus/data
     owner: nobody # added to set directory owner to be the "nobody" account
     group: nobody # added to set group owner to be the "nobody" account
     state: directory

TLDR

What caused this

The directory used by the PersistentVolume of "rancher.io/local-path" storage class doesn't have the same owner as the user in the Prometheus container.

Why

The user in the container is restricted from reading and writing files in the directory on the host machine that stores the data for the local-path volume.

How to fix it

Run a chmod on the host machine to change the user of the directory to be whatever the user inside the container is, in this case it's "nobody".

sudo chmod -R <DirectoryOfTheVolume>

Delete the failing pod (and recreate it if you're not using a Deployment)

kubectl delete pod <PodName>

Creating an AWS + NextJS site for the Cloud Resume Challenge

Patima Poochai — Tue, 24 Dec 2024 04:02:18 +0000

I've recently completed the Cloud Resume Challenge created by Forrest Brazeal. The challenge involves building a resume website using modern cloud technologies by completing a set of challenges called "Chunks". For my version, I created a static resume site in NextJS that is hosted on S3 and Cloudfront with a user count tracking feature that is implemented with Lambda and DynamoDB. I also applied modern DevOps principles by deploying the infrastructure as code with Terraform, creating a CI/CD pipeline using GitHub Actions, and performing end-to-end testing with Cypress. Working through this challenge felt like a breath of fresh air. While I’ve worked on many web projects throughout my college years, I haven’t gained as much insight and new perspectives on software development as I did from working through this challenge.

Chunk 1: The Front End

The Cloud Resume Challenge is divided into four chunks, with each chunk containing a set of steps to build each component of the project. I started with the first chunk by building the front end of the resume website which consists of a website built with NextJS. The challenge only requires a basic HTML/CSS website, but I chose to use NextJS because, having worked with web projects in the past, a pure HTML/CSS website will be more difficult to maintain in the future than a production-grade platform like NextJS. I also decided on this framework because of its flexibility in creating both server-sided and static websites, and I wanted to gain experience with this framework for future projects. I also chose to implement the frontend components as Infrastructure as Code (IaC) earlier than required using Terraform. This decision made the learning process for this tool easier as it allowed me to write IaC code on a simpler architecture, and it would save me time when I had to convert the entire project to IaC later on.

Then, I deployed the site to AWS using an S3 bucket and cache the website on the Cloudfront CDN service. I also secured the web requests sent to the website by implementing HTTPS using AWS Certificate Manager. So far this is nothing too different from my past projects, but as I moved on to the next chunks, my past experience would become less and less useful.

Chunk 2: The Back End

The main feature of chunks 2 and 3 is the visitor count. The challenge required a persistent counter to show the number of visitors that viewed the page and update the number each time the user refreshes the page. I began working on the backend of this feature in chunk 2 by writing a Lambda function in Python that stores and updates the user count as a statistic record in the DynamoDB database each time a visitor views the site. The Lambda function is activated by making HTTP POST calls to a public AWS API gateway service.

I also added the extra functionality of a counter that displays the number of unique visitors to the site. A flaw with the regular visitor counter is that it counts up when the page is refreshed, and a user can inflate the number by refreshing the page multiple times. I created another counter that keeps track of the number of unique visitors by caching the IP address of each visitor and only counting the user as unique if their IP address is not contained in the cache. I also maintain the anonymity of each visitor by storing their IP addresses as hashes in the database. This extra interactivity can make the visitors feel more welcomed by having the website accurately recall the visitor’s interaction with the website.

Chunk 3: Integrating Both Ends

While the backend infrastructure of the visitor count was completed in the previous chunk, I still needed to make sure the website retrieves the data from the backend. My work in Chunk 3 was centered around making the counter that shows the correct number of visitors on the website and writing smoke tests for both the front end and the back end. I made the website query the API gateway for visitor count data by using Javascript code to make HTTP POST requests to the API gateway endpoints.

However, new challenges began to arise when I had to write code tests for the website. In my past college projects, quality isn't as important as submitting the project. As long as they are done enough and are submitted on time, the project would be considered “completed” even if the project has a high chance of being non-functional. With this focus on getting things out of the door rather than functionality, I’ve never considered writing tests for my projects. I've always had to manually test functionality by hand, leading to me implementing incomplete and broken features that require more patches in the future. In the worst case, the patches, due to the use of manual testing, would also introduce more issues in the code that require even more patches down the road. Having worked on many projects where quality doesn’t matter, I assumed that testing to be a waste of time, and time could be better spent on troubleshooting the issues when they appear.

My mindset began to change when chunk 3 required that I write tests for the API gateway. The challenge calls for the implementation of "smoke tests," end-to-end tests that measure the full functionality of the website, in Cypress. I had to think about what could go wrong in my code, like considering how an HTTP request could contain malformed headers or what can happen when the Lambda code increments the visitor count without initializing the database.

Thinking about the potential issues and writing tests to detect them made me realize the benefit of testing. A testing framework like Cypress ensures that each test runs consistently and correctly, while also providing the ability to quickly run all of the tests in a single click. The framework offloads the toil and prevents the errors of manual testing, giving me more time to work on the features of the website. Writing tests that ensure the website is functioning properly became as important as writing the code for the website itself. The focus changed from just "getting it done", to using tools that empower me to create quality work while also making it easy to do so.

During this chunk, I've also completed the task of earning an AWS certification. The original challenge required that I acquire the AWS Certified Cloud Practitioner certificate, but I already have that certificate. To further my learning, I've decided to obtain the associate-level AWS Certified Solutions Architect certificate. In hindsight, it was a good decision to get this certificate because I learned how the AWS services work and how they integrate together in more depth than the practitioner-level certificate.

Chunk 4: The Last Stretch

On the last stretch of the challenge, I migrate the remaining backend infrastructure to IaC with Terraform. I also created a CI/CD pipeline using GitHub Actions to manage the code hosted on a GitHub repository. GitHub Actions would deploy the code to AWS and run the end-to-end tests automatically when new code is committed to either of the repositories. This tool has solved another common issue I’ve had with my project in the past as well, where each project always had a major pain point where it would take a long time to troubleshoot issues when people commit broken code into the repository. Without automatic testing, there was a lot of work to identify what was broken and track down which commit caused the issue. There were even times when people would submit more commits with broken code while I was still troubleshooting the initial issue. If I had this CI/CD system that would have automatically checked the code of each commit and identified the commit with broken code at the time, it would have saved me from many sleepless nights spent troubleshooting unknown errors.

One major problem I had in this chunk was troubleshooting incorrect permissions in the IAM policy for GitHub Actions. The GitHub Actions pipeline needed an AWS role with the permissions to access necessary services to deploy the infrastructure with Terraform, and it was unclear what permissions were needed across 7 different AWS services. At first, I figured I could attempt to deploy the infrastructure, read the errors outputted by Terraform, and manually add the missing infrastructure to the IAM permissions. However, it was time-consuming as each error message would only show one missing permission out of possibly hundreds of lines of missing permissions. It also took around 10 minutes for the pipeline to attempt a deployment and then display the errors that have occurred during the deployment. I could have kept deploying the infrastructure and adding permissions one by one like how I’ve approached my past projects before, but with how much this project has challenged my thinking, I wanted to see if there’s a better way to solve this problem.

At first, I tried capturing the needed permissions using AWS’s built-in API calls logging tool. I tried using Terraform to deploy the website to a separate test account and logged the permissions that were used. I could then use the names of the API calls to write the IAM permissions policy. However, this still didn't work as the name of the API calls isn’t a valid permission name inside an IAM policy, and some IAM permissions were missing entirely. The challenge didn’t provide any recommended way around this problem, so I was uncertain if I’d be able to find a way around these issues.

But then, I found the tool called “iamlive”. It works by creating an HTTP proxy on your local machine, capturing the API calls that Terraform made through the proxy, and then transforming them into a text file formatted as an IAM policy. At first, I wasn’t confident if the tool would work as I wasn’t able to set up the proxy with the recommended command line arguments. However, by spending some time troubleshooting what command line arguments were needed for my setup, I was able to get it to capture the IAM permissions. With this tool, I was able to finish this chunk by using the captured permissions to write IAM policies that allowed GitHub Actions to deploy Terraform infrastructure to AWS.

Conclusion

By taking this challenge, I have learned a lot about the tools used in cloud operations and the modern DevOps principles. I sped up my infrastructure deployment process using Terraform, removed the toil from running manual tests by creating end-to-end tests with Cypress, and minimized time spent troubleshooting issues with a CI/CD pipeline built with GitHub Actions. I’ve also learned that, by being willing to break away from the bad habits I’ve built up from my past project, I could change my workflow for the better.

DEV Community: Patima Poochai

How To Resize LVM Volumes Dynamically Using Cloud-Init (Step-By-Step)

The Two Steps to Grow LVM Volumes with Cloud-Init

Grow the Partition

Grow the Root Volume

Troubleshooting

Debian-specific Issues ("No free sectors")

Troubleshooting "status: disabled" and Unrecognized Cloud-init Drive Issues

Data Sources Formatting

Cloud-init Drive Interface

Closing

How to fix the "opening socket 'charon.vici' failed: Permission denied" issue in IPsec

The Problem

Systemd Service Troubleshooting

Temporary Solution From AppArmor Troubleshooting

Trying Custom AppArmor Rules

The Real Root Cause And The Long-Term Solution

Note: Vici vs Stroke

So, does it work in the end?

TLDR

Why is this happening?

How do you fix this?

Thoughts? Comments? Questions?

Building a Self-hosted IAM Platform to Add SSO to My Home Lab

Walkthrough

Final Thoughts

What I've Learned From Troubleshooting OIDC/OAuth SSO Errors in Grafana and Keycloak

Building a declarative home lab using K3s, Ansible, Helm on NixOS and Rocky Linux

Core Features

Pain of the Manual

Technologies Used

Why I chose these technologies

Highlights

Wi-Fi bridge networking

Configuration management and automation

Productivity and essential services

Container orchestration

Declarative containers

Metrics, monitoring, and visualization

Troubleshooting Prometheus

Declarative OS configuration

Lessons Learned

Reflecting on the tool choices

So, still going back to manual configuration?

How I built a home lab without an ethernet port

What scenarios would be best suited for this setup?

Why I chose to set up another router

Setup the physical devices

Gear list

WiFi range extender: TP-Link re220

Router computer: Zimaboard

Switch: TP-Link TL-SG108E managed switch

Setup the WiFi extender

Set up the router and the switch

Configure the networking layer

The general steps

Configure OPNsense

1. Install and start OPNsense according to the documentation

2. Take note of which interface is connected to the WiFi extender and which is connected to the switch

3. Assign the WAN interface to the port connected to the WiFi extender (re0) and set the IP address of the WAN interface to DHCP

4. Assign the LAN interface to the port connected to your switch and use a static IP address for the LAN interface

5. Set up a DHCP server that will run on the LAN interface.

Setup an access point with another SOHO router

WiFi channels overlapping issue

Conclusion

Takeaways

Result

See some improvements?

How to Edit the Sudoers File in NixOS - with Examples

Intro

Basics

Boilerplate

extraRules Template

Translating from normal configuration to NixOS sudoers

Applying the configuration

Common tasks with examples

Make a user become a member of the group wheel (fastest way to give privilege)

Fastest way to make the wheel group not prompt for a password

Short configuration to allow a user to run all commands as root

Allow users in certain groups to run all commands as root

3. Assign the WAN interface to the port connected to the WiFi extender (`re0`) and set the IP address of the WAN interface to DHCP