<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ces0712</title>
    <description>The latest articles on DEV Community by ces0712 (@ces0712).</description>
    <link>https://dev.to/ces0712</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3889790%2F49c35916-1545-4c64-b7d0-8b8e122026fd.png</url>
      <title>DEV Community: ces0712</title>
      <link>https://dev.to/ces0712</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ces0712"/>
    <language>en</language>
    <item>
      <title>Code In, Cluster Out: Building Reproducible Edge Kubernetes with NixOS, K3s, and Forgejo</title>
      <dc:creator>ces0712</dc:creator>
      <pubDate>Sat, 25 Apr 2026 18:09:23 +0000</pubDate>
      <link>https://dev.to/ces0712/code-in-cluster-out-building-reproducible-edge-kubernetes-with-nixos-k3s-and-forgejo-i08</link>
      <guid>https://dev.to/ces0712/code-in-cluster-out-building-reproducible-edge-kubernetes-with-nixos-k3s-and-forgejo-i08</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1txvoal9k11f3s3jxyne.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1txvoal9k11f3s3jxyne.png" alt="Cover" width="396" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What if your entire Kubernetes edge cluster, from the kernel to the workload, was a single reproducible function?&lt;/p&gt;

&lt;p&gt;No drift. No snowflakes. No, "this node got fixed manually six months ago, and nobody remembers how." Just code in, cluster out.&lt;/p&gt;

&lt;p&gt;That question led me into a project that combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;infrastructure-nixos&lt;/code&gt; for the Raspberry Pi-hosted Forgejo control path&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;edge-cluster-infra&lt;/code&gt; for Oracle networking, compute, and block storage&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;infrastructure-secrets&lt;/code&gt; for the shared SOPS-managed secret layer&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nix-k3s-edge-cluster&lt;/code&gt; for the NixOS + K3s runtime and workload layer&lt;/li&gt;
&lt;li&gt;RustDesk as a real workload proof point&lt;/li&gt;
&lt;li&gt;A Raspberry Pi-hosted Forgejo instance, a Mac mini runner, and an Oracle edge node as the deployed target&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This post is the practical version of that story: what I built, what actually worked, what hurt, and why I think the most interesting thing here is not Nix syntax, but where the source of truth lives.&lt;/p&gt;

&lt;h2&gt;
  
  
  At a glance
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;I split the system into four repositories with clear boundaries: control plane, infrastructure, secrets, and runtime.&lt;/li&gt;
&lt;li&gt;Forgejo on a Raspberry Pi is the canonical GitOps path. GitHub is only a public push mirror.&lt;/li&gt;
&lt;li&gt;Oracle infrastructure is provisioned separately, then handed off to a NixOS + K3S runtime repo.&lt;/li&gt;
&lt;li&gt;RustDesk is the end-to-end workload proof because it forces real decisions around networking, persistence, and access.&lt;/li&gt;
&lt;li&gt;The system only started to feel trustworthy once backup and restore became visible, testable, and boring.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why edge gets weird fast
&lt;/h2&gt;

&lt;p&gt;Edge environments are where nice Kubernetes assumptions go to die.&lt;/p&gt;

&lt;p&gt;They run on constrained hardware. Networks are unreliable. Access is awkward. Sometimes direct SSH is temporary, risky, or unavailable. The failure mode is not just "my pod crashed." The failure mode is "this node became special."&lt;/p&gt;

&lt;p&gt;That is what I wanted to avoid.&lt;/p&gt;

&lt;p&gt;Traditional approaches can get you pretty far:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cloud-init&lt;/li&gt;
&lt;li&gt;shell scripts&lt;/li&gt;
&lt;li&gt;Ansible&lt;/li&gt;
&lt;li&gt;golden images&lt;/li&gt;
&lt;li&gt;imperative &lt;code&gt;kubeadm&lt;/code&gt; or post-install hand tuning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the deeper I went, the more obvious the problem became: &lt;strong&gt;cluster-level reproducibility starts too late&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If the node itself is still imperative, you are already carrying drift before Kubernetes even starts.&lt;/p&gt;

&lt;p&gt;That was the core problem I wanted to solve.&lt;/p&gt;

&lt;h2&gt;
  
  
  The model I wanted instead
&lt;/h2&gt;

&lt;p&gt;I wanted the deployment artifact to be bigger than a container image.&lt;/p&gt;

&lt;p&gt;I wanted a declared system that includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Operating system state&lt;/li&gt;
&lt;li&gt;Kubernetes runtime state&lt;/li&gt;
&lt;li&gt;Workload intent&lt;/li&gt;
&lt;li&gt;Secret wiring&lt;/li&gt;
&lt;li&gt;Backup and restore readiness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is where NixOS and K3S became interesting together.&lt;/p&gt;

&lt;p&gt;K3S gives me a lightweight Kubernetes distribution that fits the edge reality much better than a heavyweight control plane.&lt;/p&gt;

&lt;p&gt;NixOS gives me a declarative host where packages, services, users, storage, networking, and system behavior are all expressed as code and activated atomically.&lt;/p&gt;

&lt;p&gt;The result is not "immutable infrastructure" in the buzzword sense. It is something more practical:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the host, the cluster, and the workload converge from the same source of truth&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The architecture
&lt;/h2&gt;

&lt;p&gt;This is the deployed shape I ended up with:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F50egpdqiyfbnyja65qjd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F50egpdqiyfbnyja65qjd.png" alt="System overview" width="800" height="231"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The important part is the split:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Forgejo on Raspberry Pi with NixOS&lt;/strong&gt; is the control path&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mac mini runner&lt;/strong&gt; executes the workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Oracle edge node&lt;/strong&gt; is the NixOS + K3s target&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RustDesk&lt;/strong&gt; is the workload used to prove the system end-to-end&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not really a story about "installing a cluster." It is a story about building an operational path that starts in Git and ends in a reproducible edge node.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I split it across four repositories
&lt;/h2&gt;

&lt;p&gt;One of the more useful lessons from this work is that the system became easier to reason about when I stopped trying to force everything into a single repo.&lt;/p&gt;

&lt;p&gt;The split is intentional:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;infrastructure-nixos&lt;/code&gt; owns the Raspberry Pi running Forgejo on NixOS&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;edge-cluster-infra&lt;/code&gt; owns the Oracle infrastructure only&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;infrastructure-secrets&lt;/code&gt; owns the encrypted secret material&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nix-k3s-edge-cluster&lt;/code&gt; owns the runtime host, K3s, apps, and validation logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That gave me cleaner boundaries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Infra does not pretend it knows about workloads&lt;/li&gt;
&lt;li&gt;Secrets do not get buried inside runtime repos&lt;/li&gt;
&lt;li&gt;The Git control path has its own declared host&lt;/li&gt;
&lt;li&gt;The edge runtime repo can focus on host + cluster + workload&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For me, that was a better shape than one giant repo full of mixed concerns.&lt;/p&gt;

&lt;p&gt;It also made the system easier to explain to other engineers, which is usually a good sign that the boundaries are doing real work.&lt;/p&gt;

&lt;h2&gt;
  
  
  What each repository is really doing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;infrastructure-nixos&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;This is the Raspberry Pi side of the story.&lt;/p&gt;

&lt;p&gt;It is not just "the box that runs Forgejo." It is a declared NixOS host with its own deploy, validate, backup, and restore flow. That matters because the GitOps engine is also part of the reproducible system, not an external assumption.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;edge-cluster-infra&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;This repo owns only the Oracle layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Networking&lt;/li&gt;
&lt;li&gt;Compute&lt;/li&gt;
&lt;li&gt;Block storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That means Infra planning and applying stay explicit, and the handoff to the runtime repo is intentional instead of magical.&lt;/p&gt;

&lt;p&gt;Technically, this repo is the place where I wanted all cloud-specific concerns to stop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenTofu and Terramate orchestration&lt;/li&gt;
&lt;li&gt;OCI networking, storage, and compute&lt;/li&gt;
&lt;li&gt;Centralized local state&lt;/li&gt;
&lt;li&gt;Runner-local var files&lt;/li&gt;
&lt;li&gt;Pre-apply state backup&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This prevents the runtime repo from turning into a hidden infrastructure repo.&lt;/p&gt;

&lt;p&gt;It also let me be explicit about something people often blur away in demos: the infrastructure state is local on purpose.&lt;/p&gt;

&lt;p&gt;For a single-operator system, I preferred a centralized local OpenTofu state path on the runner, with timestamped pre-apply archives and off-machine copies, over prematurely pretending I needed a remote backend just to look more cloud-native.&lt;/p&gt;

&lt;p&gt;That gave me a simpler operational model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One clear operator state location&lt;/li&gt;
&lt;li&gt;Automatic state backup before real infra changes&lt;/li&gt;
&lt;li&gt;An explicit restore path for the state itself&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not a forever design for every team, but it was the right design for this phase of the system.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;infrastructure-secrets&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;This is the shared SOPS repository.&lt;/p&gt;

&lt;p&gt;It keeps the secret model stable across both the Forgejo Pi and the Oracle edge node, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tailscale auth material&lt;/li&gt;
&lt;li&gt;Backup credentials&lt;/li&gt;
&lt;li&gt;Service secrets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That consistency ended up mattering more than I expected.&lt;/p&gt;

&lt;p&gt;It also kept the secret transport story stable across repos:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Edit once in the SOPS repo&lt;/li&gt;
&lt;li&gt;Decrypt at deploy time through &lt;code&gt;sops-nix&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Stage the age key on the target where required&lt;/li&gt;
&lt;li&gt;Reuse the same secret names and mental model between hosts&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;nix-k3s-edge-cluster&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;This is the runtime repo.&lt;/p&gt;

&lt;p&gt;It owns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Host definitions&lt;/li&gt;
&lt;li&gt;Reusable modules&lt;/li&gt;
&lt;li&gt;App declarations&lt;/li&gt;
&lt;li&gt;K3s manifests&lt;/li&gt;
&lt;li&gt;Deploy and validate scripts&lt;/li&gt;
&lt;li&gt;Backup and restore checks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where the edge node stops being a one-off machine and starts becoming a platform.&lt;/p&gt;

&lt;p&gt;That repo owns the behaviors that actually define the edge box:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bootstrap mode&lt;/li&gt;
&lt;li&gt;Tailscale-first access&lt;/li&gt;
&lt;li&gt;K3S server enablement&lt;/li&gt;
&lt;li&gt;RustDesk manifest generation&lt;/li&gt;
&lt;li&gt;Backup validation&lt;/li&gt;
&lt;li&gt;Restore checks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, this is the repo that answers the question "what is this node supposed to be?"&lt;/p&gt;

&lt;h2&gt;
  
  
  Cross-repo control flow
&lt;/h2&gt;

&lt;p&gt;The architecture became much easier to explain once I started treating the handoffs as first-class design elements.&lt;/p&gt;

&lt;p&gt;The actual control flow looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;edge-cluster-infra&lt;/code&gt; provisions Oracle resources and produces runtime handoff values.&lt;/li&gt;
&lt;li&gt;Forgejo, hosted through &lt;code&gt;infrastructure-nixos&lt;/code&gt;, stores the Git history and workflow definitions.&lt;/li&gt;
&lt;li&gt;The Mac mini runner executes the workflows using runner-local config and state.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nix-k3s-edge-cluster&lt;/code&gt; consumes the host/runtime intent and converges the Oracle node through Colmena.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;infrastructure-secrets&lt;/code&gt; provides the encrypted secret layer consumed through &lt;code&gt;sops-nix&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That split gave me a system where each repo has a clear question it answers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What should exist in the cloud?&lt;/li&gt;
&lt;li&gt;What stores and runs the Git control plane?&lt;/li&gt;
&lt;li&gt;What secrets are allowed into the system?&lt;/li&gt;
&lt;li&gt;What should the runtime node actually do?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a better architecture conversation than "here is my monorepo."&lt;/p&gt;

&lt;p&gt;More importantly, it created &lt;strong&gt;explicit interfaces&lt;/strong&gt; between layers instead of hidden coupling.&lt;/p&gt;

&lt;p&gt;At a high level, the interfaces look like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;edge-cluster-infra&lt;/code&gt; exports cloud facts

&lt;ul&gt;
&lt;li&gt;host identity&lt;/li&gt;
&lt;li&gt;private IP&lt;/li&gt;
&lt;li&gt;block storage attachment details&lt;/li&gt;
&lt;li&gt;region and subnet context&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;
&lt;code&gt;infrastructure-secrets&lt;/code&gt; exports secret facts

&lt;ul&gt;
&lt;li&gt;Tailscale auth material&lt;/li&gt;
&lt;li&gt;backup credentials&lt;/li&gt;
&lt;li&gt;service secrets&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;
&lt;code&gt;nix-k3s-edge-cluster&lt;/code&gt; consumes those facts and turns them into runtime behavior&lt;/li&gt;

&lt;li&gt;
&lt;code&gt;infrastructure-nixos&lt;/code&gt; provides the self-hosted Git control path that stores the automation driving everything else&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;That may sound obvious, but it changed the system's maintainability. Once the handoffs are explicit, you can reason about changes without reloading the whole stack into your head.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the runtime repo actually looks like
&lt;/h2&gt;

&lt;p&gt;I tried hard to avoid a magical repo where every concern is buried in nested abstractions.&lt;/p&gt;

&lt;p&gt;The runtime repo is intentionally legible:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9cccei6f2iawdbe5538g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9cccei6f2iawdbe5538g.png" alt="Runtime repo structure" width="800" height="579"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That shape matters because it keeps the boundaries visible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;hosts/&lt;/code&gt; describes the target machine&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;modules/&lt;/code&gt; captures reusable NixOS behavior&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;apps/&lt;/code&gt; declares workload intent&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;scripts/&lt;/code&gt; handles the operational glue&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the kind of structure that helps when you come back six months later and need to answer: &lt;em&gt;where does this behavior actually come from?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It also fits the repo split cleanly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Oracle provisioning lives elsewhere&lt;/li&gt;
&lt;li&gt;Secrets live elsewhere&lt;/li&gt;
&lt;li&gt;Forgejo lives elsewhere&lt;/li&gt;
&lt;li&gt;This repo focuses on the runtime contract for the edge node&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Technically, that contract is a composition of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;host-specific config in &lt;code&gt;hosts/cloud-edge-1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;reusable platform modules in &lt;code&gt;modules/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;app-specific Nix modules in &lt;code&gt;apps/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;operator wrappers in &lt;code&gt;scripts/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;workflows in &lt;code&gt;.forgejo/workflows/&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That structure let me keep the slide-sized code excerpts honest. I was not cherry-picking from a giant, ambiguous config file. The repo actually has that shape.&lt;/p&gt;

&lt;p&gt;This also improves failure analysis.&lt;/p&gt;

&lt;p&gt;When something breaks, the search space is narrower:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cloud provisioning bug?

&lt;ul&gt;
&lt;li&gt;look in &lt;code&gt;edge-cluster-infra.&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;secret resolution bug?

&lt;ul&gt;
&lt;li&gt;look in &lt;code&gt;infrastructure-secrets&lt;/code&gt; and &lt;code&gt;sops-nix&lt;/code&gt; wiring&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;host convergence bug?

&lt;ul&gt;
&lt;li&gt;look in &lt;code&gt;nix-k3s-edge-cluster/modules&lt;/code&gt; or deploy scripts&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Git/control-plane bug?

&lt;ul&gt;
&lt;li&gt;look in &lt;code&gt;infrastructure-nixos.&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;That kind of fault isolation is hard to achieve when every concern shares the same repo and abstractions.&lt;/p&gt;

&lt;p&gt;For a platform that spans cloud provisioning, secrets, host convergence, and workload deployment, that reduction in search space is a real operational advantage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Host declaration vs cluster declaration
&lt;/h2&gt;

&lt;p&gt;One subtle advantage of this model is that NixOS and K3S give you two distinct but adjacent declaration layers.&lt;/p&gt;

&lt;p&gt;At the host layer, I can declare things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bootstrap behavior&lt;/li&gt;
&lt;li&gt;Root login policy&lt;/li&gt;
&lt;li&gt;Tailscale access&lt;/li&gt;
&lt;li&gt;Local state directories&lt;/li&gt;
&lt;li&gt;Backup paths&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At the Kubernetes layer, I can declare workload intent through &lt;code&gt;services.k3s.manifests&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In many setups, the control plane is where declarative intent starts. In this setup, declarative intent starts one layer lower:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Disk and filesystem assumptions&lt;/li&gt;
&lt;li&gt;Bootloader behavior&lt;/li&gt;
&lt;li&gt;SSH posture&lt;/li&gt;
&lt;li&gt;VPN access plane&lt;/li&gt;
&lt;li&gt;K3S control-plane role&lt;/li&gt;
&lt;li&gt;Workload manifests&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That does not replace Kubernetes. It gives Kubernetes a more deterministic substrate to stand on.&lt;/p&gt;

&lt;p&gt;The difference is subtle, but it changes how you think about reliability. Container reproducibility matters. Host reproducibility matters too.&lt;/p&gt;

&lt;h2&gt;
  
  
  A real workload, not a hello world
&lt;/h2&gt;

&lt;p&gt;I did not want to prove this with a toy deployment.&lt;/p&gt;

&lt;p&gt;RustDesk was a better example because it forces me to care about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Network exposure&lt;/li&gt;
&lt;li&gt;Host integration&lt;/li&gt;
&lt;li&gt;Persistent data&lt;/li&gt;
&lt;li&gt;Real end-to-end usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this model, workload intent is still declared in Nix and handed to K3S as manifests.&lt;/p&gt;

&lt;p&gt;That is the key idea:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;not "Nix installs K3S."&lt;/li&gt;
&lt;li&gt;but "Nix declares what the cluster should run."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this case, the workload design also forced a few concrete runtime decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;hostNetwork = true&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;host-backed persistent state&lt;/li&gt;
&lt;li&gt;explicit RustDesk server containers&lt;/li&gt;
&lt;li&gt;a private access model through Tailscale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is exactly why RustDesk was useful here. It is opinionated enough to flush out whether the system is real.&lt;/p&gt;

&lt;p&gt;Architecturally, it also exercised the kinds of assumptions that usually get postponed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How the host state is mounted into the workload runtime&lt;/li&gt;
&lt;li&gt;Which ports and network model does the workload expect&lt;/li&gt;
&lt;li&gt;Whether the access plane is public, private, or tunneled&lt;/li&gt;
&lt;li&gt;How workload identity and persistent state survive rebuilds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That made it a much better proving ground than a simple HTTP deployment would have been.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I used Colmena for runtime convergence
&lt;/h2&gt;

&lt;p&gt;I wanted runtime deployment to be obviously Nix-native.&lt;/p&gt;

&lt;p&gt;Colmena fits because it keeps the host convergence model close to the flake and module structure, rather than introducing a second orchestration abstraction for the runtime layer.&lt;/p&gt;

&lt;p&gt;That gave me a clean separation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenTofu/Terramate for cloud infra&lt;/li&gt;
&lt;li&gt;Colmena for host/runtime convergence&lt;/li&gt;
&lt;li&gt;K3S manifests for workload state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I like that split because each tool owns its own distinct layer, rather than competing for the same responsibilities.&lt;/p&gt;

&lt;p&gt;That tool separation gave me a layered control model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenTofu/Terramate answer: "What cloud resources should exist?"&lt;/li&gt;
&lt;li&gt;Colmena answers: "What should this host converge to?"&lt;/li&gt;
&lt;li&gt;K3S answers: "What should run inside the cluster?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because those questions stay distinct, the implementation stays more legible.&lt;/p&gt;

&lt;h2&gt;
  
  
  GitOps path: visible, staged, and boring
&lt;/h2&gt;

&lt;p&gt;One of the best parts of the final setup is that the GitOps path is visible.&lt;/p&gt;

&lt;p&gt;Not hidden in shell history. Not living in a one-off laptop script. Not dependent on a human remembering the right sequence.&lt;/p&gt;

&lt;p&gt;The workflow surface in Forgejo looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh4e02w1or8ox3x01oirj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh4e02w1or8ox3x01oirj.png" alt="Forgejo workflow overview" width="800" height="265"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And the infra apply path is explicit:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxsys7nmx6kb8766uy180.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxsys7nmx6kb8766uy180.png" alt="Forgejo apply details" width="800" height="399"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What I like about this is not just that it is automated. It is that the automation has shape:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;checks&lt;/li&gt;
&lt;li&gt;plan&lt;/li&gt;
&lt;li&gt;apply&lt;/li&gt;
&lt;li&gt;handoff values&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can explain it to another engineer without having to narrate a shell session.&lt;/p&gt;

&lt;p&gt;And because Forgejo itself lives in &lt;code&gt;infrastructure-nixos&lt;/code&gt;, the GitOps path is also part of the same self-hosted story.&lt;/p&gt;

&lt;p&gt;The important detail here is that the runner is not pretending to be stateless.&lt;/p&gt;

&lt;p&gt;The Mac mini is deliberately the trust anchor for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SSH identities&lt;/li&gt;
&lt;li&gt;Local OCI auth&lt;/li&gt;
&lt;li&gt;Centralized OpenTofu state&lt;/li&gt;
&lt;li&gt;The handoff between infra and runtime repos&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I did not try to turn that into fake cloud-native purity. I made it explicit and automated around it.&lt;/p&gt;

&lt;p&gt;That tradeoff is important.&lt;/p&gt;

&lt;p&gt;I am not claiming that the Mac mini is an ideal universal model. I am saying that making the trust anchor explicit was better than pretending I had a stateless control plane while quietly depending on a stateful operator machine anyway.&lt;/p&gt;

&lt;p&gt;In practice, that produced a more honest system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runner-local SSH identities&lt;/li&gt;
&lt;li&gt;Runner-local OCI auth&lt;/li&gt;
&lt;li&gt;Centralized local OpenTofu state&lt;/li&gt;
&lt;li&gt;Explicit workflow-to-runtime handoff&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That honesty made the automation simpler, not weaker.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-hosted control plane, public mirror
&lt;/h2&gt;

&lt;p&gt;One detail that matters a lot in practice is that I did &lt;strong&gt;not&lt;/strong&gt; want GitHub to become the canonical source of truth just because I wanted public visibility.&lt;/p&gt;

&lt;p&gt;Forgejo, running on the Raspberry Pi through &lt;code&gt;infrastructure-nixos&lt;/code&gt;, remains the primary remote.&lt;/p&gt;

&lt;p&gt;GitHub is a downstream push mirror.&lt;/p&gt;

&lt;p&gt;Architecturally, that distinction matters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Workflows live in the self-hosted control plane&lt;/li&gt;
&lt;li&gt;Operational history lives in the self-hosted control plane&lt;/li&gt;
&lt;li&gt;The mirror gives public visibility without taking control away from the self-hosted path&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In my setup, that mirror is configured as a native Forgejo push mirror over HTTPS using a fine-grained GitHub token scoped to the destination repository.&lt;/p&gt;

&lt;p&gt;That is a boring detail, but a useful one. It means I get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A self-hosted GitOps path on NixOS&lt;/li&gt;
&lt;li&gt;Public repository visibility on GitHub&lt;/li&gt;
&lt;li&gt;No need to pretend GitHub is the control plane when it is not&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For me, that turned out to be the right compromise between operational ownership and public sharing.&lt;/p&gt;

&lt;p&gt;It also kept the public story aligned with the operational one.&lt;/p&gt;

&lt;p&gt;People can discover the work on GitHub, but the real automation path still begins with a self-hosted Forgejo instance on NixOS. That means I am not maintaining one narrative for public code and another for actual deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hard part was not Nix syntax
&lt;/h2&gt;

&lt;p&gt;Nix has a learning curve. That part is real.&lt;/p&gt;

&lt;p&gt;But the hardest part of this project was not writing Nix expressions. The hardest part was &lt;strong&gt;operational sequencing&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is the part that took the most real engineering:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm3c5m333treqem06sqbo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm3c5m333treqem06sqbo.png" alt="Bootstrap flow" width="800" height="566"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The tricky bits were things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Temporary OCI bootstrap SSH&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nixos-infect&lt;/code&gt; and bootloader behavior&lt;/li&gt;
&lt;li&gt;First deployment while bootstrap access still exists&lt;/li&gt;
&lt;li&gt;Reboot validation&lt;/li&gt;
&lt;li&gt;Switching to steady-state Tailscale-first operations&lt;/li&gt;
&lt;li&gt;Keeping the repo boundaries clean even while the bootstrap path was still awkward&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is where the real learning was.&lt;/p&gt;

&lt;p&gt;The host was not "done" when the config evaluated. The host was done when the recovery and access model made sense after reboot.&lt;/p&gt;

&lt;p&gt;That sequencing point is worth emphasizing because it is easy to miss in declarative systems. A config can be correct but still not operationally safe if the transition order is wrong.&lt;/p&gt;

&lt;p&gt;That is the sort of problem that shows up all the time in real infrastructure work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The final state is reasonable&lt;/li&gt;
&lt;li&gt;The transition path is what bites you&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For edge systems, those transition paths matter even more because recovery access, bootstrap access, and steady-state access are often different systems.&lt;/p&gt;

&lt;p&gt;That is also why I no longer think of bootstrap as a small implementation detail. It is part of the architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Proof that the workload really landed
&lt;/h2&gt;

&lt;p&gt;I wanted the talk and this post to prove real runtime state, not just repo beauty.&lt;/p&gt;

&lt;p&gt;Inside the cluster:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F40imner80uaip7r76qx2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F40imner80uaip7r76qx2.png" alt="RustDesk containers in k9s" width="800" height="262"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That screenshot is important because it shows the two expected RustDesk server containers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;hbbs&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;hbbr&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That ties directly back to the declared workload.&lt;/p&gt;

&lt;p&gt;And from the user side:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8qtwrxzy64xtmzj7bfzb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8qtwrxzy64xtmzj7bfzb.png" alt="RustDesk client sees target device" width="800" height="1050"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0jaulc2ex9uwues7heff.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0jaulc2ex9uwues7heff.png" alt="RustDesk remote session live" width="357" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That is the level of proof I wanted:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Declared in code&lt;/li&gt;
&lt;li&gt;Deployed through the workflow&lt;/li&gt;
&lt;li&gt;Visible in the cluster&lt;/li&gt;
&lt;li&gt;Usable by a real client&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For me, that combination mattered more than a polished demo. It tied together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Repo intent&lt;/li&gt;
&lt;li&gt;Deploy path&lt;/li&gt;
&lt;li&gt;Cluster state&lt;/li&gt;
&lt;li&gt;User-visible result&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Backups are layered, not singular
&lt;/h2&gt;

&lt;p&gt;One thing I do not want to undersell is that "backups" here are not one checkbox.&lt;/p&gt;

&lt;p&gt;The platform ended up with multiple backup types because each layer has a different failure mode.&lt;/p&gt;

&lt;h3&gt;
  
  
  Forgejo control plane backups
&lt;/h3&gt;

&lt;p&gt;On the Raspberry Pi side, &lt;code&gt;infrastructure-nixos&lt;/code&gt; uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Restic -&amp;gt; BorgBase&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;daily append-only backups for Forgejo repositories, custom files, and database snapshots&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Rclone -&amp;gt; pCloud&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;weekly backups of Forgejo LFS objects&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;That split is intentional.&lt;/p&gt;

&lt;p&gt;The Forgejo host is both the Git control plane and the workflow surface, so protecting it means protecting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Repository data&lt;/li&gt;
&lt;li&gt;The database state&lt;/li&gt;
&lt;li&gt;Large file storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Treating those as a single undifferentiated blob would have been simpler on paper, but worse in practice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Edge runtime backups
&lt;/h3&gt;

&lt;p&gt;On the Oracle edge node side, &lt;code&gt;nix-k3s-edge-cluster&lt;/code&gt; reuses the shared Restic credentials from &lt;code&gt;infrastructure-secrets&lt;/code&gt; and backs up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RustDesk runtime state&lt;/li&gt;
&lt;li&gt;the K3s server token&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is deliberately narrower than "back up the whole cluster filesystem."&lt;/p&gt;

&lt;p&gt;The current goal is to protect the runtime state that matters for rebuild and access continuity, without pretending I have a fully solved embedded-etcd disaster-recovery story yet.&lt;/p&gt;

&lt;h3&gt;
  
  
  Infrastructure state backups
&lt;/h3&gt;

&lt;p&gt;The infra layer has a different recovery model again.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;edge-cluster-infra&lt;/code&gt; creates timestamped OpenTofu state archives, writes a checksum alongside them, and uploads them to pCloud before applying operations that change saved plans.&lt;/p&gt;

&lt;p&gt;That means the state for Oracle provisioning is also recoverable without turning the Git repo into a fake state backend.&lt;/p&gt;

&lt;p&gt;So the backup model is really three different recovery contracts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Forgejo control-plane backups&lt;/li&gt;
&lt;li&gt;runtime/workload backups&lt;/li&gt;
&lt;li&gt;infrastructure state backups&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is closer to how the system actually behaves than saying "I have backups."&lt;/p&gt;

&lt;p&gt;More importantly, those layers are recoverable in different ways for different reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Control-plane recovery protects the Git and workflow surface&lt;/li&gt;
&lt;li&gt;Runtime recovery protects workload continuity and host-side state&lt;/li&gt;
&lt;li&gt;Infra-state recovery protects the declarative record of what exists in Oracle&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a much healthier model than hoping one backup mechanism will magically cover every layer.&lt;/p&gt;

&lt;p&gt;If there is one pattern I would reuse immediately in another platform, it is this one: separate the backup story by failure domain, then make each one operational.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recovery is part of the story
&lt;/h2&gt;

&lt;p&gt;A reproducible system that cannot prove its backup readiness remains fragile.&lt;/p&gt;

&lt;p&gt;That became another important boundary for the project: backup and restore could not remain undocumented "future work."&lt;/p&gt;

&lt;p&gt;This is the recovery proof I ended up using:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg2bkoowssdh36smq5t4s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg2bkoowssdh36smq5t4s.png" alt="Backup validate workflow" width="800" height="251"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Why that matters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backup readiness is visible in automation&lt;/li&gt;
&lt;li&gt;Repository access is validated before the failure day&lt;/li&gt;
&lt;li&gt;Restore confidence is operational, not tribal knowledge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the point where the system starts to feel trustworthy.&lt;/p&gt;

&lt;p&gt;This is another place where reusing the proven patterns from &lt;code&gt;infrastructure-nixos&lt;/code&gt; helped. The edge node did not need a completely different recovery philosophy. It needed the same discipline carried into a new runtime.&lt;/p&gt;

&lt;p&gt;That is also why I kept backup and restore in the runtime repo instead of leaving them as external runbook folklore. If the runtime contract matters, the recovery contract matters too.&lt;/p&gt;

&lt;p&gt;That lesson came directly from the Forgejo Pi work.&lt;/p&gt;

&lt;p&gt;The control-plane host was not "done" when backups existed. It was only done once backup validation, restore checks, and a real restore path had been exercised end-to-end. In the Forgejo case, that meant testing the restore flow on spare media, not just trusting that the timer had run.&lt;/p&gt;

&lt;p&gt;That distinction changed how I treated the Oracle edge node as well. Backup configuration was not enough. I wanted recovery to be visible, testable, and boring.&lt;/p&gt;

&lt;p&gt;From an architectural perspective, backup validation became another declarative boundary.&lt;/p&gt;

&lt;p&gt;It was not enough to say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The host is declared&lt;/li&gt;
&lt;li&gt;The cluster is declared&lt;/li&gt;
&lt;li&gt;The workload is declared&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I also wanted to be able to say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backup readiness is testable&lt;/li&gt;
&lt;li&gt;restore prerequisites that are testable&lt;/li&gt;
&lt;li&gt;Recovery is not a purely manual institution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a different reliability bar than "we probably have a backup."&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would improve next
&lt;/h2&gt;

&lt;p&gt;If I keep evolving this design, the next things I want to deepen are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More explicit restore rehearsal for the K3S control-plane side&lt;/li&gt;
&lt;li&gt;Broader workload examples beyond RustDesk&lt;/li&gt;
&lt;li&gt;Clearer multi-node stories using the same runtime model&lt;/li&gt;
&lt;li&gt;Continued reduction of bootstrap awkwardness at the host boundary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The current system is already useful, but it also makes the next engineering questions very visible, which I see as a good sign.&lt;/p&gt;

&lt;p&gt;Reproducibility is not real until recovery is boring.&lt;/p&gt;

&lt;p&gt;That sentence sounds simple, but it changed the way I evaluated the whole system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters beyond my own lab
&lt;/h2&gt;

&lt;p&gt;I do not find this interesting just because it worked on my hardware.&lt;/p&gt;

&lt;p&gt;I find it interesting because the pattern generalizes to environments where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nodes are remote or operationally awkward&lt;/li&gt;
&lt;li&gt;Rebuilding confidence is more important than convenience&lt;/li&gt;
&lt;li&gt;Infrastructure teams need tighter control over host drift&lt;/li&gt;
&lt;li&gt;Kubernetes alone is not enough to describe the real system boundary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That includes a lot of practical scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;remote edge nodes&lt;/li&gt;
&lt;li&gt;air-gapped or semi-connected environments&lt;/li&gt;
&lt;li&gt;regulated systems where configuration provenance matters&lt;/li&gt;
&lt;li&gt;smaller platform teams that need deterministic rebuilds without building a giant internal platform product&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The implementation here is specific. The architectural lesson is broader.&lt;/p&gt;

&lt;h2&gt;
  
  
  When this approach makes sense
&lt;/h2&gt;

&lt;p&gt;I do not think NixOS + K3S is a universal answer.&lt;/p&gt;

&lt;p&gt;I think it is a strong fit when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nodes are remote&lt;/li&gt;
&lt;li&gt;Rebuilding confidence matters&lt;/li&gt;
&lt;li&gt;You want host + cluster + workload under one model&lt;/li&gt;
&lt;li&gt;Multi-arch or odd hardware is part of the story&lt;/li&gt;
&lt;li&gt;Operational consistency matters more than lowest-friction onboarding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I think it is a weaker fit when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your team only needs app-level GitOps&lt;/li&gt;
&lt;li&gt;The node itself is disposable&lt;/li&gt;
&lt;li&gt;You need the lowest-complexity path for a broad team quickly&lt;/li&gt;
&lt;li&gt;Nix would become the main project instead of supporting the real project&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the tradeoff I would be most honest about with any team considering this approach: the determinism is real, but so is the complexity budget.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part I find most interesting
&lt;/h2&gt;

&lt;p&gt;The most interesting thing here is not Nix by itself.&lt;/p&gt;

&lt;p&gt;It is moving the source of truth &lt;strong&gt;down to the node boundary&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you can declare:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The host&lt;/li&gt;
&lt;li&gt;The cluster&lt;/li&gt;
&lt;li&gt;The workload&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a single coherent system, you get a different kind of reliability than you do from container reproducibility alone.&lt;/p&gt;

&lt;p&gt;That, to me, is the real promise of this model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Repos
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/ces0712/infrastructure-nixos" rel="noopener noreferrer"&gt;infrastructure-nixos&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/ces0712/edge-cluster-infra" rel="noopener noreferrer"&gt;edge-cluster-infra&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/ces0712/infrastructure-secrets" rel="noopener noreferrer"&gt;infrastructure-secrets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/ces0712/nix-k3s-edge-cluster" rel="noopener noreferrer"&gt;nix-k3s-edge-cluster&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want, I can also publish a follow-up focused just on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The bootstrap/reboot sequence&lt;/li&gt;
&lt;li&gt;How I split infra vs runtime repos&lt;/li&gt;
&lt;li&gt;Or the backup and restore workflow&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>nixos</category>
      <category>k3s</category>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
