<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: SEON</title>
    <description>The latest articles on DEV Community by SEON (@seon).</description>
    <link>https://dev.to/seon</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3963614%2F9312fece-b0a1-4cdf-9adb-62fd1f21f29e.jpg</url>
      <title>DEV Community: SEON</title>
      <link>https://dev.to/seon</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/seon"/>
    <language>en</language>
    <item>
      <title>Hands-on DevOps #1 — GitLab CI/CD Components &amp; Catalog: Build, Publish, and Consume by Version</title>
      <dc:creator>SEON</dc:creator>
      <pubDate>Sun, 14 Jun 2026 14:26:49 +0000</pubDate>
      <link>https://dev.to/seon/hands-on-devops-1-gitlab-cicd-components-catalog-build-publish-and-consume-by-version-3i14</link>
      <guid>https://dev.to/seon/hands-on-devops-1-gitlab-cicd-components-catalog-build-publish-and-consume-by-version-3i14</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Build&lt;/strong&gt; — Put a component under &lt;code&gt;templates/&lt;/code&gt; and declare its inputs with &lt;code&gt;spec:inputs&lt;/code&gt; — types, defaults, &lt;code&gt;options&lt;/code&gt;, even regex. Invalid values are rejected before the pipeline is even created.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Publish &amp;amp; consume&lt;/strong&gt; — Push a semantic-version tag and the &lt;code&gt;release&lt;/code&gt; job publishes the component to the &lt;strong&gt;CI/CD Catalog&lt;/strong&gt;; other projects pull it in with &lt;code&gt;include: component@version&lt;/code&gt;. Version ranges like &lt;code&gt;@1&lt;/code&gt; and &lt;code&gt;@~latest&lt;/code&gt; let you control breaking changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How it's verified&lt;/strong&gt; — Every output in this article was captured by running directly against a real gitlab.com project, &lt;a href="https://gitlab.com/SEON.N/gitlab-ci-components-catalog" rel="noopener noreferrer"&gt;SEON.N/gitlab-ci-components-catalog&lt;/a&gt; (public), via pipelines, releases, and the CI Lint API.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;When you first set up a CI/CD pipeline, most people start by copying another project's &lt;code&gt;.gitlab-ci.yml&lt;/code&gt;. It works for now, but as projects multiply, the same configuration gets duplicated everywhere and three problems keep recurring.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Discoverability&lt;/strong&gt; — There's no way to know whether someone has already built the same build/test/deploy job. So every team rewrites a similar pipeline from scratch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reusability&lt;/strong&gt; — You can pull in another file with &lt;code&gt;include&lt;/code&gt;, but there's no versioning and no input validation. When the source file changes, the pipelines referencing it break without warning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contribution&lt;/strong&gt; — There's no standard path to safely distribute a well-built pipeline piece across the organization and announce "here it is, go use it."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;CI/CD components and the catalog exist to solve exactly these three. In one line: they turn pipeline configuration from "code you copy-paste" into "a versioned package."&lt;/p&gt;

&lt;p&gt;Just as you don't copy a library's source wholesale but pull it in by version with &lt;code&gt;npm install&lt;/code&gt; or a Go module, components let you pull pipeline pieces in by version, like a dependency. And the CI/CD Catalog is the marketplace that gathers those components in one place so you can search and discover them.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;CI/CD component&lt;/strong&gt; is what GitLab defines as a "reusable single pipeline configuration unit." You pull it in with &lt;code&gt;include&lt;/code&gt; just like before, but two things are fundamentally different.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Typed inputs (&lt;/strong&gt;&lt;code&gt;**spec:inputs**&lt;/code&gt;&lt;strong&gt;)&lt;/strong&gt; — A component can declare &lt;code&gt;string&lt;/code&gt;/&lt;code&gt;number&lt;/code&gt;/&lt;code&gt;boolean&lt;/code&gt;/&lt;code&gt;array&lt;/code&gt; types, defaults, &lt;code&gt;options&lt;/code&gt; (enum), and &lt;code&gt;regex&lt;/code&gt; validation. Invalid values are rejected before the pipeline is even created.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic versioning + catalog&lt;/strong&gt; — Components are released with semantic versions and discovered in the &lt;strong&gt;CI/CD Catalog&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Compared with the existing &lt;code&gt;include&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;What it pulls in&lt;/th&gt;
&lt;th&gt;Typed inputs&lt;/th&gt;
&lt;th&gt;Version/Catalog&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;include:local&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;a file in the same repo&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;include:project&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;a file from another project (ref)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;include:remote&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;a file at a URL&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;include:template&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;a GitLab-provided template&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;**include:component**&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;a component (spec + job)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes (Catalog)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In one line: where &lt;code&gt;include:local&lt;/code&gt;/&lt;code&gt;include:project&lt;/code&gt;/&lt;code&gt;include:remote&lt;/code&gt; "copy a file as-is," &lt;code&gt;include:component&lt;/code&gt; "pulls in a dependency with an explicit version and input contract (&lt;code&gt;spec&lt;/code&gt;)." The key difference is that you're pulling in not a plain file but a building block with a guaranteed input spec and version.&lt;/p&gt;

&lt;p&gt;CI/CD components and the CI/CD Catalog reached &lt;strong&gt;GA in GitLab 17.0 (2024-05-16)&lt;/strong&gt;; before that they were experimental/beta. This article uses GitLab.com (always the latest version).&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it's useful
&lt;/h2&gt;

&lt;p&gt;Components shine when "you repeat the same thing across many projects." Common real-world use cases:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;In this situation&lt;/th&gt;
&lt;th&gt;Solve it with a component&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Apply the same security scan or lint to every repository&lt;/td&gt;
&lt;td&gt;Build a scan component once and put it in the catalog; each project gets the same checks with a few lines of &lt;code&gt;include&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Standardize a deploy routine (Cloud Run, Kubernetes, etc.)&lt;/td&gt;
&lt;td&gt;Take the environment name and image tag as &lt;code&gt;inputs&lt;/code&gt;, and reuse the same component across teams by changing only the values&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A platform team wants to enforce a company-standard pipeline&lt;/td&gt;
&lt;td&gt;Gather components at the group level and offer them as a catalog — you get governance (enforcement) and DRY (de-duplication) at the same time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Control breaking changes, like a library upgrade&lt;/td&gt;
&lt;td&gt;Consumers pin a partial version like &lt;code&gt;@1&lt;/code&gt; to auto-accept only non-breaking updates, or pin an exact version&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The flagship example GitLab cited at GA was a Google Cloud Run deploy component, and the most common adoption driver is turning "jobs every team repeats the same way" — security scan, build, lint, deploy — into shared, organization-wide building blocks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Version history and direction
&lt;/h2&gt;

&lt;p&gt;Components and the catalog haven't stood still since they went GA in 17.0. GitLab keeps adding to this area with each quarterly release.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;th&gt;What was added/changed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Beta (2023-12)&lt;/td&gt;
&lt;td&gt;CI/CD Catalog beta released&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17.0 (2024-05)&lt;/td&gt;
&lt;td&gt;Components + catalog reach full GA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;18.0&lt;/td&gt;
&lt;td&gt;The &lt;code&gt;release&lt;/code&gt; job's standard image moved from &lt;code&gt;release-cli&lt;/code&gt; to &lt;code&gt;glab&lt;/code&gt;; &lt;code&gt;release-cli&lt;/code&gt; is deprecated, removal planned for 20.0 (until then it falls back automatically when glab is absent)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;18.5&lt;/td&gt;
&lt;td&gt;Per-project component limit raised 30 → 100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;18.6–18.7&lt;/td&gt;
&lt;td&gt;Component context expression — a component can access its own metadata such as name and version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;18.9&lt;/td&gt;
&lt;td&gt;Catalog resource usage analytics introduced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;19.0 (2026-05)&lt;/td&gt;
&lt;td&gt;Detailed component usage for maintainers — track which project uses which version&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The direction is clear: the core syntax (&lt;code&gt;spec:inputs&lt;/code&gt;, semantic versioning, &lt;code&gt;release&lt;/code&gt;, the catalog) has stayed stable since GA, while operational conveniences (usage analytics, usage detail, context expressions) and raised limits keep being layered on top.&lt;/p&gt;

&lt;p&gt;So it's a stable foundation you learn once and use for a long time, and at the same time an area GitLab actively expands every quarter. This article's hands-on was run on gitlab.com (always latest), so the captured behavior already reflects the newest version.&lt;/p&gt;

&lt;p&gt;The availability range is broad too. Components, the catalog, and &lt;code&gt;spec:inputs&lt;/code&gt; all work across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tier: Free · Premium · Ultimate — all tiers&lt;/li&gt;
&lt;li&gt;Offering: GitLab.com (SaaS) · GitLab Self-Managed · GitLab Dedicated — all offerings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even on the free tier or self-managed, the core actions — build, publish, and consume components — work as-is. Just remember two things. First, 19.0's "detailed component usage" is Premium-and-up only. Second, a self-managed instance's catalog starts with zero published components, so you fill it by publishing your own or mirroring from GitLab.com, and the instance must be on 17.0 or later.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F38mvs3hy4vvc0wqrffe7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F38mvs3hy4vvc0wqrffe7.png" alt="ecosystem" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The component project (&lt;code&gt;templates/*.yml&lt;/code&gt; + &lt;code&gt;.gitlab-ci.yml&lt;/code&gt; + README/LICENSE) publishes versions to the catalog via release, and any consumer project pulls them in with &lt;code&gt;include: component@version&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lifecycle
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy2buq5rsoluc8wgvpddq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy2buq5rsoluc8wgvpddq.png" alt="lifecycle" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(1) Write the component and push to main → (2) the self-test pipeline actually runs the component → (3) tag a semantic version → (4) the &lt;code&gt;release&lt;/code&gt; job publishes → (5) the version is registered in the catalog → (6) consumers include it. Because self-test runs first even on the tag pipeline, a broken component never gets published.&lt;/p&gt;

&lt;h2&gt;
  
  
  include resolution and input validation
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F07ujd7wqtzueelo2mi81.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F07ujd7wqtzueelo2mi81.png" alt="resolution" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;include: component@version&lt;/code&gt; is resolved at &lt;strong&gt;pipeline creation time&lt;/strong&gt;: version resolution (tag/SHA/partial/~latest) → fetch the template → input validation (type/options/regex) → &lt;code&gt;$[[ inputs.x ]]&lt;/code&gt; interpolation → merge the job. If it's blocked at validation, the pipeline fails before a runner ever spins up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;This hands-on uses only local &lt;code&gt;glab&lt;/code&gt;, &lt;code&gt;git&lt;/code&gt;, &lt;code&gt;curl&lt;/code&gt;, and &lt;code&gt;python3&lt;/code&gt;. Use an already-installed, authenticated &lt;code&gt;glab&lt;/code&gt;; if you don't have it, install it per the table below (to use it as a container, &lt;code&gt;docker run&lt;/code&gt; the &lt;code&gt;registry.gitlab.com/gitlab-org/cli&lt;/code&gt; image).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;th&gt;macOS&lt;/th&gt;
&lt;th&gt;Windows&lt;/th&gt;
&lt;th&gt;Linux&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GitLab&lt;/td&gt;
&lt;td&gt;17.0+&lt;/td&gt;
&lt;td&gt;— (SaaS)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;glab&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1.40+&lt;/td&gt;
&lt;td&gt;&lt;code&gt;brew install glab&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;winget install glab.glab&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;package manager, or &lt;code&gt;docker run --rm -it registry.gitlab.com/gitlab-org/cli:latest&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;git&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2.30+&lt;/td&gt;
&lt;td&gt;preinstalled&lt;/td&gt;
&lt;td&gt;preinstalled / &lt;code&gt;winget install Git.Git&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;preinstalled / distro package&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;curl&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;7.x+&lt;/td&gt;
&lt;td&gt;preinstalled&lt;/td&gt;
&lt;td&gt;preinstalled&lt;/td&gt;
&lt;td&gt;preinstalled&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;a PAT with &lt;code&gt;api&lt;/code&gt; + &lt;code&gt;write_repository&lt;/code&gt; scope, or &lt;code&gt;glab auth login&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Runner&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;enable a shared/group runner on the project (to run pipelines)&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;glab&lt;/code&gt;/&lt;code&gt;git&lt;/code&gt;/&lt;code&gt;curl&lt;/code&gt; use the same commands regardless of OS. The commands below are identical across all three; only differences are noted separately.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Core concepts
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1) Component directory structure
&lt;/h3&gt;

&lt;p&gt;A component project places components under a top-level &lt;code&gt;templates/&lt;/code&gt; directory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;├── templates/
│   ├── greeting.yml          # single-file component
│   └── my-other/             # directory-form component
│       └── template.yml      # only this file is published
├── LICENSE.md
├── README.md
└── .gitlab-ci.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A single file is &lt;code&gt;templates/&amp;lt;name&amp;gt;.yml&lt;/code&gt;; a more complex component is &lt;code&gt;templates/&amp;lt;name&amp;gt;/template.yml&lt;/code&gt;. In the directory form, only &lt;code&gt;template.yml&lt;/code&gt; is published — the rest (build/test helpers) are not.&lt;/p&gt;

&lt;h3&gt;
  
  
  2) &lt;code&gt;spec:inputs&lt;/code&gt; — typed inputs
&lt;/h3&gt;

&lt;p&gt;A component file is split into two YAML documents. Above &lt;code&gt;---&lt;/code&gt; is the &lt;code&gt;spec&lt;/code&gt; (input declarations); below it is the actual job definition.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;      &lt;span class="c1"&gt;# string (default) / number / boolean / array&lt;/span&gt;
      &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test&lt;/span&gt;      &lt;span class="c1"&gt;# with a default it's optional; without one it's required&lt;/span&gt;
    &lt;span class="na"&gt;style&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;options&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;plain&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;banner&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# allowed-value whitelist (enum)&lt;/span&gt;
    &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;regex&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;^v?\d+\.\d+\.\d+$&lt;/span&gt;    &lt;span class="c1"&gt;# regex validation (RE2)&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="c1"&gt;# interpolate with $[[ inputs.NAME ]] in the job definition&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key is the &lt;strong&gt;"why."&lt;/strong&gt; Input validation happens at &lt;strong&gt;pipeline creation time&lt;/strong&gt; (when the configuration is fetched). So invalid input is rejected before a runner ever spins up, saving cost and time. A pipeline can take up to 20 inputs.&lt;/p&gt;

&lt;h3&gt;
  
  
  3) Interpolation &lt;code&gt;$[[ inputs.x ]]&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Unlike the CI variable &lt;code&gt;$VAR&lt;/code&gt;, input interpolation uses the &lt;code&gt;$[[ inputs.name ]]&lt;/code&gt; syntax. It works in the job &lt;strong&gt;name&lt;/strong&gt;, in &lt;strong&gt;scripts&lt;/strong&gt;, and on array elements &lt;code&gt;$[[ inputs.arr[0] ]]&lt;/code&gt;. Interpolation is evaluated once at config-fetch time and stays fixed for the whole pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  4) Version references
&lt;/h3&gt;

&lt;p&gt;Components are referenced in this priority order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Commit SHA&lt;/strong&gt; — &lt;code&gt;@e3262fdd...&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tag&lt;/strong&gt; — &lt;code&gt;@v1.0.0&lt;/code&gt; (catalog publishing requires a semantic-version tag)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Branch&lt;/strong&gt; — &lt;code&gt;@main&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partial version / latest&lt;/strong&gt; — &lt;code&gt;@1.2&lt;/code&gt;, &lt;code&gt;@1&lt;/code&gt;, &lt;code&gt;@~latest&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;~latest&lt;/code&gt; points to the latest released version (excluding pre-releases), so breaking changes can flow in automatically. In production, prefer a pinned version or a partial version like &lt;code&gt;@1&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  5) include path and &lt;code&gt;release&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;include&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;component&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$CI_SERVER_FQDN/&amp;lt;project-path&amp;gt;/&amp;lt;component-name&amp;gt;@&amp;lt;version&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;$CI_SERVER_FQDN&lt;/code&gt; is a predefined variable for the GitLab host FQDN, so the same config works across instances. And for a version to appear in the catalog, you must create the release with the &lt;code&gt;**release**&lt;/code&gt; &lt;strong&gt;keyword&lt;/strong&gt; (not the Releases API).&lt;/p&gt;

&lt;h2&gt;
  
  
  Hands-on steps
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1 — Create the project and clone
&lt;/h3&gt;

&lt;p&gt;This step creates the empty project (the container) that will hold the components. To publish to the catalog, a project needs a description, a README, and components under &lt;code&gt;templates/&lt;/code&gt;, so we prepare that frame first. There are no components yet — we just take the empty repo and set up the working directory.&lt;/p&gt;

&lt;p&gt;We created it with a description via the API for clarity (the catalog requires a description). &lt;code&gt;glab repo create&lt;/code&gt; works too.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Get the token from glab config, or export a PAT directly&lt;/span&gt;
&lt;span class="nv"&gt;GITLAB_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;glab config get token &lt;span class="nt"&gt;--host&lt;/span&gt; gitlab.com&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;REPO_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gitlab-ci-components-catalog"&lt;/span&gt;
&lt;span class="nv"&gt;DESC&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Reusable GitLab CI/CD components (greeting, semver-guard) published to the CI/CD Catalog. Hands-on samples."&lt;/span&gt;

&lt;span class="c"&gt;# Create via API (or: glab repo create "SEON.N/${REPO_NAME}" --public --description "${DESC}")&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;--request&lt;/span&gt; POST &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"PRIVATE-TOKEN: &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GITLAB_TOKEN&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;name&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;REPO_NAME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;path&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;REPO_NAME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;namespace_id&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&amp;lt;your-namespace-id&amp;gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;visibility&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;public&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;description&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;DESC&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://gitlab.com/api/v4/projects"&lt;/span&gt;

git clone &lt;span class="s2"&gt;"https://oauth2:&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GITLAB_TOKEN&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;@gitlab.com/SEON.N/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;REPO_NAME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.git"&lt;/span&gt; &lt;span class="s2"&gt;"/tmp/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;REPO_NAME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"/tmp/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;REPO_NAME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
git symbolic-ref HEAD refs/heads/main
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; templates
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;created: SEON.N/gitlab-ci-components-catalog | id 83321559 | vis public
Cloning into '/tmp/gitlab-ci-components-catalog'...
warning: You appear to have cloned an empty repository.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Note: A service account or CI environment may not have an SSH key, so do git push in the &lt;code&gt;https://oauth2:${GITLAB_TOKEN}@...&lt;/code&gt; form. The token needs &lt;code&gt;api&lt;/code&gt; (create) and &lt;code&gt;write_repository&lt;/code&gt; (push) scope.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 2 — The &lt;code&gt;greeting&lt;/code&gt; component (4 inputs + interpolation)
&lt;/h3&gt;

&lt;p&gt;This step writes the first component. To show the two essentials — typed inputs and interpolation — at once, we build &lt;code&gt;greeting&lt;/code&gt;, which takes four inputs and injects them into the job name and the script. It's the smallest example of a component whose behavior changes with its inputs.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;templates/greeting.yml&lt;/code&gt;. It uses &lt;code&gt;string&lt;/code&gt;/&lt;code&gt;boolean&lt;/code&gt;/&lt;code&gt;options&lt;/code&gt;/&lt;code&gt;default&lt;/code&gt;, and interpolates into the job name and the script.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
      &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Who&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;greet.&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Required:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;no&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;default."&lt;/span&gt;
    &lt;span class="na"&gt;style&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
      &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;plain&lt;/span&gt;
      &lt;span class="na"&gt;options&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;plain&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;banner&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;shout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;boolean&lt;/span&gt;
      &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;greeting&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$[[&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;inputs.name&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;]]"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;          &lt;span class="c1"&gt;# interpolated into the job name too&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$[[ inputs.stage ]]&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;alpine:3.20&lt;/span&gt;
  &lt;span class="na"&gt;variables&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;GREET_NAME&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$[[&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;inputs.name&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;]]"&lt;/span&gt;
    &lt;span class="na"&gt;GREET_STYLE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$[[&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;inputs.style&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;]]"&lt;/span&gt;
    &lt;span class="na"&gt;GREET_SHOUT&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$[[&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;inputs.shout&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;]]"&lt;/span&gt;
  &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;MSG="Hello, ${GREET_NAME}!"&lt;/span&gt;
      &lt;span class="s"&gt;if [ "${GREET_SHOUT}" = "true" ]; then&lt;/span&gt;
        &lt;span class="s"&gt;MSG="$(echo "${MSG}" | tr '[:lower:]' '[:upper:]')"&lt;/span&gt;
      &lt;span class="s"&gt;fi&lt;/span&gt;
      &lt;span class="s"&gt;if [ "${GREET_STYLE}" = "banner" ]; then&lt;/span&gt;
        &lt;span class="s"&gt;LINE="$(echo "${MSG}" | sed 's/./=/g')"&lt;/span&gt;
        &lt;span class="s"&gt;printf '%s\n%s\n%s\n' "${LINE}" "${MSG}" "${LINE}"&lt;/span&gt;
      &lt;span class="s"&gt;else&lt;/span&gt;
        &lt;span class="s"&gt;echo "${MSG}"&lt;/span&gt;
      &lt;span class="s"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;File: &lt;a href="https://gitlab.com/SEON.N/gitlab-ci-components-catalog/-/blob/main/templates/greeting.yml" rel="noopener noreferrer"&gt;templates/greeting.yml&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: Write multi-line scripts as a &lt;code&gt;- |&lt;/code&gt; block scalar. If you write &lt;code&gt;- echo "OK: ..."&lt;/code&gt;, the &lt;code&gt;:&lt;/code&gt; (colon+space) inside a YAML plain scalar is parsed as a mapping separator and causes a parse error (see Troubleshooting below).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 3 — The &lt;code&gt;semver-guard&lt;/code&gt; component (regex) + &lt;code&gt;.gitlab-ci.yml&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;This step builds a second component to show input validation (regex), and at the same time wires up &lt;code&gt;.gitlab-ci.yml&lt;/code&gt; so the project self-tests its own components and releases from a tag. You assemble the component and the "publishing pipeline skeleton" together.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;templates/semver-guard.yml&lt;/code&gt; validates its input with a regex.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
      &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test&lt;/span&gt;
    &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
      &lt;span class="na"&gt;regex&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;^v?\d+\.\d+\.\d+$&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;semver-guard&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$[[ inputs.stage ]]&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;alpine:3.20&lt;/span&gt;
  &lt;span class="na"&gt;variables&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;INPUT_VERSION&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$[[&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;inputs.version&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;]]"&lt;/span&gt;
  &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;set -e&lt;/span&gt;
      &lt;span class="s"&gt;echo "Validating version '${INPUT_VERSION}'"&lt;/span&gt;
      &lt;span class="s"&gt;echo "${INPUT_VERSION}" | grep -Eq '^v?[0-9]+\.[0-9]+\.[0-9]+$'&lt;/span&gt;
      &lt;span class="s"&gt;echo "OK - '${INPUT_VERSION}' is a valid semantic version"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The project's own &lt;code&gt;.gitlab-ci.yml&lt;/code&gt; (a) self-tests the components at the current SHA, and (b) creates a release from a tag.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;stages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;release&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;include&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;component&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$CI_SERVER_FQDN/$CI_PROJECT_PATH/greeting@$CI_COMMIT_SHA&lt;/span&gt;
    &lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GitLab&lt;/span&gt;
      &lt;span class="na"&gt;style&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;banner&lt;/span&gt;
      &lt;span class="na"&gt;shout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;component&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$CI_SERVER_FQDN/$CI_PROJECT_PATH/semver-guard@$CI_COMMIT_SHA&lt;/span&gt;
    &lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1.0.0&lt;/span&gt;

&lt;span class="na"&gt;create-release&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;release&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registry.gitlab.com/gitlab-org/cli:latest&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$CI_COMMIT_TAG&lt;/span&gt;
  &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;echo "Creating release for tag $CI_COMMIT_TAG"&lt;/span&gt;
  &lt;span class="na"&gt;release&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;tag_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$CI_COMMIT_TAG&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Release&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$CI_COMMIT_TAG&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;of&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;components."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;File: &lt;a href="https://gitlab.com/SEON.N/gitlab-ci-components-catalog/-/blob/main/.gitlab-ci.yml" rel="noopener noreferrer"&gt;.gitlab-ci.yml&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: Using &lt;code&gt;@$CI_COMMIT_SHA&lt;/code&gt; in the self-test pulls in exactly the components in the commit you just pushed, at that point in time — the "test yourself in your own pipeline" pattern. Before pushing, we validated the YAML syntax locally.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 - &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;PY&lt;/span&gt;&lt;span class="sh"&gt;'
import yaml
for f in ["templates/greeting.yml","templates/semver-guard.yml",".gitlab-ci.yml"]:
    print("OK", f, len(list(yaml.safe_load_all(open(f)))), "doc")
&lt;/span&gt;&lt;span class="no"&gt;PY
&lt;/span&gt;&lt;span class="c"&gt;# OK templates/greeting.yml 2 doc&lt;/span&gt;
&lt;span class="c"&gt;# OK templates/semver-guard.yml 2 doc&lt;/span&gt;
&lt;span class="c"&gt;# OK .gitlab-ci.yml 1 doc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4 — Push and run the self-test pipeline (real output)
&lt;/h3&gt;

&lt;p&gt;This step checks the component you wrote actually runs. Pushing to main triggers the self-test pipeline, which runs the component you just made exactly as it is in that commit. It's the gate that filters out broken components before publishing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git add &lt;span class="nb"&gt;.&lt;/span&gt;
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Add greeting and semver-guard reusable CI/CD components"&lt;/span&gt;
git push &lt;span class="nt"&gt;-u&lt;/span&gt; origin main
&lt;span class="c"&gt;# pushing to main triggers the self-test pipeline&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check the pipeline and job status via the API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;PID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;83321559
curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"PRIVATE-TOKEN: &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GITLAB_TOKEN&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://gitlab.com/api/v4/projects/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/pipelines?per_page=1"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Real output (job list)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;14845634397 | semver-guard     | test | success | 8.4 s
14845634396 | greeting GitLab  | test | success | 10.3 s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the job is named &lt;code&gt;greeting GitLab&lt;/code&gt; — the interpolation of &lt;code&gt;"greeting $[[ inputs.name ]]"&lt;/code&gt; worked. The job logs (trace) look like this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real output (greeting job trace)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ MSG="Hello, ${GREET_NAME}!" # collapsed multi-line command
==============
HELLO, GITLAB!
==============
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Real output (semver-guard job trace)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ set -e # collapsed multi-line command
Validating version 'v1.0.0'
OK - 'v1.0.0' is a valid semantic version
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Note: Both &lt;code&gt;style: banner&lt;/code&gt; (banner output) and &lt;code&gt;shout: true&lt;/code&gt; (uppercasing) took effect, so &lt;code&gt;HELLO, GITLAB!&lt;/code&gt; printed in a box. You can also see the &lt;code&gt;boolean&lt;/code&gt; input interpolated into the script as the string &lt;code&gt;"true"&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 5 — Register the catalog resource + publish the version (real output)
&lt;/h3&gt;

&lt;p&gt;This step puts the verified component in the catalog so it can be "consumed by version." First mark the project as a catalog resource, then push a semantic-version tag so the release job publishes that version to the catalog. From here, other projects can search for and use it.&lt;/p&gt;

&lt;p&gt;Mark the project as a "catalog resource." In the UI it's the "CI/CD Catalog project" toggle under &lt;strong&gt;Settings &amp;gt; General &amp;gt; Visibility&lt;/strong&gt;, but for automation use the GraphQL &lt;code&gt;catalogResourcesCreate&lt;/code&gt; (no REST yet, issue 463043).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;--request&lt;/span&gt; POST &lt;span class="s2"&gt;"https://gitlab.com/api/graphql"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GITLAB_TOKEN&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="s1"&gt;'{"query":"mutation { catalogResourcesCreate(input: { projectPath: \"SEON.N/gitlab-ci-components-catalog\" }) { errors } }"}'&lt;/span&gt;
&lt;span class="c"&gt;# =&amp;gt; {"data":{"catalogResourcesCreate":{"errors":[]}}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now pushing a semantic-version tag runs the tag pipeline: self-test, then the &lt;code&gt;release&lt;/code&gt; job.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git tag v1.0.0
git push origin v1.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Real output (tag pipeline jobs)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;14845637416 | create-release   | release | success
14845637415 | semver-guard     | test    | success
14845637414 | greeting GitLab  | test    | success
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Real output (create-release job trace)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ echo "Creating release for tag $CI_COMMIT_TAG"
Creating release for tag v1.0.0
• Creating or updating release  repo=SEON.N/gitlab-ci-components-catalog tag=v1.0.0
✓ Release created:    url=https://gitlab.com/SEON.N/gitlab-ci-components-catalog/-/releases/v1.0.0
✓ Release succeeded after 0.77 seconds.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9aot0nyo1736zk5oidtb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9aot0nyo1736zk5oidtb.png" alt="create-release job log — release created with the registry.gitlab.com/gitlab-org/cli (glab) image" width="800" height="575"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Check that the version was registered in the catalog via GraphQL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;--request&lt;/span&gt; POST &lt;span class="s2"&gt;"https://gitlab.com/api/graphql"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GITLAB_TOKEN&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="s1"&gt;'{"query":"{ ciCatalogResource(fullPath: \"SEON.N/gitlab-ci-components-catalog\") { name webPath versions { count nodes { name } } } }"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"data"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"ciCatalogResource"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"gitlab-ci-components-catalog"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"webPath"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"/SEON.N/gitlab-ci-components-catalog"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"versions"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"nodes"&lt;/span&gt;&lt;span class="p"&gt;:[{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"v1.0.0"&lt;/span&gt;&lt;span class="p"&gt;}]}}}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbddn4e5n3ax90ngvwf1m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbddn4e5n3ax90ngvwf1m.png" alt="v1.0.0 published in the CI/CD Catalog (Components: greeting, semver-guard)" width="800" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: The image the &lt;code&gt;release&lt;/code&gt; job uses is, per the current official docs, &lt;code&gt;registry.gitlab.com/gitlab-org/cli&lt;/code&gt; (glab) — changed from the former &lt;code&gt;release-cli&lt;/code&gt;. Pushing only a tag without the &lt;code&gt;release&lt;/code&gt; keyword does not create a version in the catalog.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 6 — Verify consumer include (CI Lint API, real output)
&lt;/h3&gt;

&lt;p&gt;This step checks whether another project can actually consume the published version. Without spinning up a runner, the CI Lint API alone confirms that version references and input validation behave as intended.&lt;/p&gt;

&lt;p&gt;We verify with the CI Lint API (&lt;code&gt;POST /projects/:id/ci/lint&lt;/code&gt;) whether another project can pull in the published version. include resolution and input validation are confirmed without a runner.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;PID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;83321559
curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;--request&lt;/span&gt; POST &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"PRIVATE-TOKEN: &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GITLAB_TOKEN&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="s1"&gt;'{"include_jobs":true,"content":"stages: [test]\ninclude:\n  - component: gitlab.com/SEON.N/gitlab-ci-components-catalog/greeting@v1.0.0\n    inputs:\n      name: World\n      style: plain"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://gitlab.com/api/v4/projects/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/ci/lint"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output (three version references)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ref @v1.0.0     -&amp;gt; valid=True  jobs=['greeting World']  errors=[]
ref @~latest    -&amp;gt; valid=True  jobs=['greeting World']  errors=[]
ref @1.0.0      -&amp;gt; valid=False errors=["Component '.../greeting@1.0.0' - content not found"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since the tag is &lt;code&gt;v1.0.0&lt;/code&gt;, the include must also be &lt;code&gt;@v1.0.0&lt;/code&gt; (or &lt;code&gt;@~latest&lt;/code&gt;). &lt;code&gt;@1.0.0&lt;/code&gt; (no &lt;code&gt;v&lt;/code&gt; prefix) differs from the tag name, so it returns "content not found." Next, validate &lt;code&gt;semver-guard&lt;/code&gt;'s regex input.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real output (regex input pass/reject)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== valid version v2.3.4 (regex pass) ===
valid: True | jobs: ['semver-guard'] | errors: []

=== invalid version 'not-a-version' (regex reject) ===
valid: False | errors: ['`.../semver-guard@v1.0.0`: `version` input: provided value does not match required RegEx pattern']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpypd4thkh4n9nj6aijwm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpypd4thkh4n9nj6aijwm.png" alt="semver-guard in the catalog UI — version is a regex-validated input" width="800" height="575"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: An invalid version string is blocked by regex validation at the pipeline-creation stage, before the job ever runs. This is the core value of a component's typed/regex inputs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Advanced hands-on
&lt;/h2&gt;

&lt;p&gt;The basic loop — build the components (Steps 1–3), verify with self-test (Step 4), publish to the catalog (Step 5), confirm a consumer can pull them in (Step 6) — ends here. Below are advanced patterns common in practice: real consumption from another project, controlling breaking changes with version ranges, and composing multiple components.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 7 — Actually consume from a consumer project (usage)
&lt;/h3&gt;

&lt;p&gt;The point of the catalog is consuming a component with &lt;code&gt;include&lt;/code&gt; from &lt;strong&gt;another project&lt;/strong&gt;, not the one that built it. So we created a separate &lt;a href="https://gitlab.com/SEON.N/ci-components-consumer-demo" rel="noopener noreferrer"&gt;consumer project&lt;/a&gt; (public) and actually pulled it in.&lt;/p&gt;

&lt;p&gt;The consumer project's &lt;code&gt;.gitlab-ci.yml&lt;/code&gt; only needs to pull the components in by version.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;stages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;include&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;component&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$CI_SERVER_FQDN/&amp;lt;your-namespace&amp;gt;/gitlab-ci-components-catalog/greeting@v1.0.0&lt;/span&gt;
    &lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Consumer&lt;/span&gt;
      &lt;span class="na"&gt;style&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;banner&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;component&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$CI_SERVER_FQDN/&amp;lt;your-namespace&amp;gt;/gitlab-ci-components-catalog/semver-guard@v1.0.0&lt;/span&gt;
    &lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v3.1.4&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Real output (consumer pipeline jobs)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;greeting Consumer | test | success
semver-guard      | test | success
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2f1x4ufpzqsmefu0h58.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2f1x4ufpzqsmefu0h58.png" alt="Consumer project pipeline — running greeting and semver-guard included from the catalog" width="800" height="575"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The job named &lt;code&gt;greeting Consumer&lt;/code&gt; confirms the interpolation worked with the consumer's input (&lt;code&gt;name: Consumer&lt;/code&gt;). In other words, once a component is published, any project pulls it in with its own inputs.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: The catalog aggregates a per-component usage count. Query it via GraphQL.&lt;/p&gt;


&lt;pre class="highlight shell"&gt;&lt;code&gt;glab api graphql &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="nv"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{ ciCatalogResource(fullPath:"&amp;lt;group&amp;gt;/&amp;lt;project&amp;gt;"){ versions{ nodes{ name components{ nodes{ name last30DayUsageCount } } } } } }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;However, &lt;code&gt;last30DayUsageCount&lt;/code&gt; is not reflected immediately after consumption (in our test it was still 0 on a same-day re-query). Check usage after the aggregation has refreshed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 8 — v2.0.0 release and version ranges (controlling breaking changes)
&lt;/h3&gt;

&lt;p&gt;Signal a breaking change with a MAJOR bump. We changed &lt;code&gt;greeting&lt;/code&gt;'s default &lt;code&gt;style&lt;/code&gt; from &lt;code&gt;plain&lt;/code&gt; to &lt;code&gt;banner&lt;/code&gt; (a change that alters existing consumers' default output) and released &lt;code&gt;v2.0.0&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# templates/greeting.yml: change style default from plain to banner (breaking)&lt;/span&gt;
git commit &lt;span class="nt"&gt;-am&lt;/span&gt; &lt;span class="s2"&gt;"greeting v2.0.0: change default style to banner (breaking change)"&lt;/span&gt;
git tag v2.0.0
git push origin main v2.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the tag pipeline passes self-test and then release, two versions coexist in the catalog.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real output (catalog versions)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"versions"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"nodes"&lt;/span&gt;&lt;span class="p"&gt;:[{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"v2.0.0"&lt;/span&gt;&lt;span class="p"&gt;},{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"v1.0.0"&lt;/span&gt;&lt;span class="p"&gt;}]}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the version a consumer pulls in depends on which reference it uses. We verified with CI Lint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real output (version reference resolution)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@v1.0.0  -&amp;gt; valid     exact tag
@v2.0.0  -&amp;gt; valid     exact tag
@1       -&amp;gt; valid     latest in the 1.x range = v1.0.0
@2       -&amp;gt; valid     latest in the 2.x range = v2.0.0
@~latest -&amp;gt; valid     latest released (pre-releases excluded) = v2.0.0
@1.0.0   -&amp;gt; invalid   content not found (the tag is v1.0.0)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What matters is the operational strategy. Even when the breaking &lt;code&gt;v2.0.0&lt;/code&gt; ships, a consumer pinned to &lt;code&gt;@1&lt;/code&gt; keeps receiving 1.x and stays safe. A consumer on &lt;code&gt;@~latest&lt;/code&gt;, however, automatically gets v2.0.0 and its behavior changes (here, the default output becomes banner). So production pipelines should use a partial version like &lt;code&gt;@1&lt;/code&gt; or a pinned version, and avoid &lt;code&gt;@~latest&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 9 — Composing components (multiple components in one pipeline)
&lt;/h3&gt;

&lt;p&gt;The real power of components is "assembly": bundling several single-purpose components as stages into a standard pipeline. Here we add one more component, &lt;code&gt;lint&lt;/code&gt;, and compose the self-test as a &lt;code&gt;lint → test&lt;/code&gt; flow.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;templates/lint.yml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
      &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;lint&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;component-lint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$[[ inputs.stage ]]&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;alpine:3.20&lt;/span&gt;
  &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;echo "Linting component templates..."&lt;/span&gt;
      &lt;span class="s"&gt;for f in templates/*.yml; do echo "checking ${f}"; done&lt;/span&gt;
      &lt;span class="s"&gt;echo "OK - lint passed"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;.gitlab-ci.yml&lt;/code&gt; sets stages to &lt;code&gt;[lint, test, release]&lt;/code&gt; and adds lint (the lint stage) before the existing greeting/semver-guard (test).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;stages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;lint&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;release&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;include&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;component&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$CI_SERVER_FQDN/$CI_PROJECT_PATH/lint@$CI_COMMIT_SHA&lt;/span&gt;
    &lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;lint&lt;/span&gt;
  &lt;span class="c1"&gt;# greeting and semver-guard are the same as Step 3 (test stage)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Real output (pipeline graph)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv2nujmbmnnd2j18vdz5l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv2nujmbmnnd2j18vdz5l.png" alt="self-test pipeline composed as lint → test in two stages" width="800" height="575"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;lint&lt;/code&gt; stage's &lt;code&gt;component-lint&lt;/code&gt; must pass first, then the &lt;code&gt;test&lt;/code&gt; stage's &lt;code&gt;greeting&lt;/code&gt;/&lt;code&gt;semver-guard&lt;/code&gt; run. Each component is versioned and published independently, but the consumer composes them as stages into a single standard pipeline — this is the heart of an "organization-standard pipeline catalog."&lt;/p&gt;

&lt;h2&gt;
  
  
  Verification
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Check&lt;/th&gt;
&lt;th&gt;Command/method&lt;/th&gt;
&lt;th&gt;Expected&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Component works&lt;/td&gt;
&lt;td&gt;self-test pipeline&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;greeting&lt;/code&gt;/&lt;code&gt;semver-guard&lt;/code&gt; jobs success&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interpolation&lt;/td&gt;
&lt;td&gt;job name&lt;/td&gt;
&lt;td&gt;&lt;code&gt;greeting GitLab&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalog publish&lt;/td&gt;
&lt;td&gt;GraphQL &lt;code&gt;ciCatalogResource&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;versions.count == 1&lt;/code&gt;, &lt;code&gt;v1.0.0&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Release&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GET /projects/:id/releases&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;tag_name: v1.0.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Consumer resolution&lt;/td&gt;
&lt;td&gt;CI Lint &lt;code&gt;@v1.0.0&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;valid=true&lt;/code&gt;, job merged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Input validation&lt;/td&gt;
&lt;td&gt;CI Lint, invalid value&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;valid=false&lt;/code&gt;, RegEx error&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On failure: if the pipeline doesn't start, check the project's runner setup (shared/group runner) and the &lt;code&gt;.gitlab-ci.yml&lt;/code&gt; syntax (&lt;code&gt;glab ci lint&lt;/code&gt;); if the version doesn't appear in the catalog, check (1) catalog-resource registration, (2) use of the &lt;code&gt;release&lt;/code&gt; keyword, (3) presence of a description/README.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Permission model &amp;amp; token rotation&lt;/strong&gt;: Publishing requires the Owner role and &lt;code&gt;api&lt;/code&gt;+&lt;code&gt;write_repository&lt;/code&gt; scope. In CI, prefer &lt;code&gt;CI_JOB_TOKEN&lt;/code&gt; or a group access token where possible, and set an expiry on PATs and rotate them regularly. A token with no expiry is especially risky, so always have an expiry/rotation policy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governance&lt;/strong&gt;: Gather component projects at the group level to enforce a standard pipeline. Signal breaking changes with semantic versions (MAJOR.MINOR.PATCH); consumers pin &lt;code&gt;@1&lt;/code&gt; (a partial version) to auto-accept only non-breaking updates, or pin an exact version.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;**\~latest**&lt;/code&gt; &lt;strong&gt;caution&lt;/strong&gt;: Convenient, but breaking changes flow in automatically. Production pipelines should prefer a pinned version.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security hardening&lt;/strong&gt;: Put &lt;code&gt;regex&lt;/code&gt;/&lt;code&gt;options&lt;/code&gt; on the inputs a component takes to shrink the arbitrary-command-injection surface. When taking an image tag as input, restrict it with a regex too.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring &amp;amp; observability&lt;/strong&gt;: Wire release/pipeline failures to alerts — e.g., a webhook to a notification channel on pipeline failure. (Integrate with your environment's observability stack, e.g. Prometheus/Loki/Tempo.) Expose the pipeline status badge and the catalog version count as dashboard metrics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure-recovery runbook&lt;/strong&gt;: If you published a bad version: (1) bump a patch version and republish, (2) consumers pin to the last good version, (3) if a &lt;code&gt;~latest&lt;/code&gt; consumer broke, switch it to a pinned version immediately.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common mistakes &amp;amp; troubleshooting
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;content not found&lt;/code&gt; on &lt;code&gt;include&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;tag is &lt;code&gt;v1.0.0&lt;/code&gt; but referenced as &lt;code&gt;@1.0.0&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;match the include version to the actual tag name (&lt;code&gt;@v1.0.0&lt;/code&gt;), or use &lt;code&gt;@~latest&lt;/code&gt;/&lt;code&gt;@1&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML parse error &lt;code&gt;expected &amp;lt;block end&amp;gt;&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;a &lt;code&gt;:&lt;/code&gt; (colon+space) in a plain-scalar script (&lt;code&gt;echo "OK: ..."&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;write multi-line scripts as a block scalar so colon+space isn't in a plain scalar&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;version doesn't appear in the catalog&lt;/td&gt;
&lt;td&gt;project isn't registered as a catalog resource&lt;/td&gt;
&lt;td&gt;enable via &lt;code&gt;catalogResourcesCreate&lt;/code&gt; GraphQL or the UI toggle&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;registered but zero versions&lt;/td&gt;
&lt;td&gt;pushed only a tag without the &lt;code&gt;release&lt;/code&gt; keyword&lt;/td&gt;
&lt;td&gt;add a &lt;code&gt;release:&lt;/code&gt; job to the pipeline and re-push the tag&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pipeline doesn't start at all&lt;/td&gt;
&lt;td&gt;no active runner on the project (gitlab.com free tier may require account verification)&lt;/td&gt;
&lt;td&gt;enable a shared/group runner; check the group's &lt;code&gt;shared_runners_setting&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;401 with a bad token&lt;/td&gt;
&lt;td&gt;using a self-managed or expired token&lt;/td&gt;
&lt;td&gt;use the correct host's token via &lt;code&gt;glab config get token --host gitlab.com&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Going further
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;**array**&lt;/code&gt; &lt;strong&gt;inputs and indexing&lt;/strong&gt;: use structured inputs like &lt;code&gt;$[[ inputs.servers[0].host ]]&lt;/code&gt; (max 5 indices per segment).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strengthen component tests&lt;/strong&gt;: add a negative test to the self-test that asserts invalid input is actually rejected, so a broken component never gets published.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search the CI/CD Catalog&lt;/strong&gt;: explore public components at &lt;code&gt;https://gitlab.com/explore/catalog&lt;/code&gt; to learn reuse patterns.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cleanup
&lt;/h2&gt;

&lt;p&gt;This hands-on uses only git push and release, and stands up no extra infrastructure. gitlab.com pipelines consume a small amount of the project's CI minutes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# After the hands-on, clean up (optional):&lt;/span&gt;
&lt;span class="c"&gt;# - Delete the whole project only after your own approval. To keep it, just revoke the token.&lt;/span&gt;
&lt;span class="c"&gt;# - Manage the temporary token used for git push per your expiry/rotation policy.&lt;/span&gt;
&lt;span class="c"&gt;# - Clean the local working directory: rm -rf /tmp/gitlab-ci-components-catalog&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cost/billing note&lt;/strong&gt;: Publishing to the catalog itself costs nothing extra. But running pipelines consumes &lt;strong&gt;CI minutes&lt;/strong&gt;. Use &lt;code&gt;rules&lt;/code&gt; to avoid repeatedly triggering large self-tests.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;CI/CD components | GitLab Docs — &lt;a href="https://docs.gitlab.com/ci/components/" rel="noopener noreferrer"&gt;https://docs.gitlab.com/ci/components/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;CI/CD inputs | GitLab Docs — &lt;a href="https://docs.gitlab.com/ci/inputs/" rel="noopener noreferrer"&gt;https://docs.gitlab.com/ci/inputs/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;CI/CD Catalog goes GA (GitLab Blog, 2024-05-08) — &lt;a href="https://about.gitlab.com/blog/ci-cd-catalog-goes-ga-no-more-building-pipelines-from-scratch/" rel="noopener noreferrer"&gt;https://about.gitlab.com/blog/ci-cd-catalog-goes-ga-no-more-building-pipelines-from-scratch/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Introducing the CI/CD Catalog beta (GitLab Blog, 2023-12-21) — &lt;a href="https://about.gitlab.com/blog/introducing-the-gitlab-ci-cd-catalog-beta/" rel="noopener noreferrer"&gt;https://about.gitlab.com/blog/introducing-the-gitlab-ci-cd-catalog-beta/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitLab 17.0 release (2024-05-16) — &lt;a href="https://about.gitlab.com/blog/gitlab-17-0-release" rel="noopener noreferrer"&gt;https://about.gitlab.com/blog/gitlab-17-0-release&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;CI/CD Catalog (explore) — &lt;a href="https://gitlab.com/explore/catalog" rel="noopener noreferrer"&gt;https://gitlab.com/explore/catalog&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GraphQL API reference — &lt;a href="https://docs.gitlab.com/api/graphql/reference/" rel="noopener noreferrer"&gt;https://docs.gitlab.com/api/graphql/reference/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>gitlab</category>
      <category>cicd</category>
      <category>devops</category>
      <category>ci</category>
    </item>
    <item>
      <title>Hybrid k3s #5: Putting kubectl down — GitOps 1/3</title>
      <dc:creator>SEON</dc:creator>
      <pubDate>Sun, 14 Jun 2026 13:03:11 +0000</pubDate>
      <link>https://dev.to/seon/hybrid-k3s-5-putting-kubectl-down-gitops-13-eig</link>
      <guid>https://dev.to/seon/hybrid-k3s-5-putting-kubectl-down-gitops-13-eig</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.seon.world%2Fimages%2Fk3s-5%2Fk3s-5-master-en.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.seon.world%2Fimages%2Fk3s-5%2Fk3s-5-master-en.png" alt="Hybrid k3s — full architecture" width="800" height="592"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  0. About this series
&lt;/h2&gt;

&lt;p&gt;This series is a record — written one piece at a time — of how I built the homelab in the image above, the one that's still running as I write this.&lt;/p&gt;

&lt;p&gt;What started as a toy project from a simple "would this even work?" turned, through satisfying performance and an endless cycle of tearing down and rebuilding, into a real toy that takes the edge off the stress that builds up at work. It isn't a resource-rich cluster, but it's been more than enough to get a real taste of Kubernetes, and it keeps handing me the next thing I want to try.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;6 nodes&lt;/strong&gt; — 2 Lightsail &lt;strong&gt;servers&lt;/strong&gt; (control plane + etcd) in the cloud (AWS Tokyo) + 4 &lt;strong&gt;Lima VM agents&lt;/strong&gt; on a home (Sapporo) iMac&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;19 vCPU / 61 GiB&lt;/strong&gt; total, &lt;strong&gt;49 namespaces&lt;/strong&gt;, &lt;strong&gt;248 pods (150 running)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Deployed with &lt;strong&gt;ArgoCD&lt;/strong&gt;, auth via &lt;strong&gt;Keycloak OIDC&lt;/strong&gt;, with CloudNativePG, Vault, CrowdSec, Prometheus/Grafana and more running on top&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Through part 4, I stood up the cluster and the CloudNativePG on it &lt;strong&gt;imperatively (&lt;code&gt;helm&lt;/code&gt;, &lt;code&gt;kubectl apply&lt;/code&gt;)&lt;/strong&gt;. This part is about turning all of that &lt;strong&gt;into GitOps with ArgoCD&lt;/strong&gt;. The scope — from tool choice to cluster bootstrap to secret management — is too wide for one part, so I'll &lt;strong&gt;split it into three.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 5 (this one) · Design&lt;/strong&gt; — why GitOps, and what to use (tool and structure), decided by comparison.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 6 · Bootstrap&lt;/strong&gt; — install ArgoCD and stand up the cluster's skeleton with app-of-apps and ApplicationSet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 7 · Apply&lt;/strong&gt; — move CloudNativePG over to GitOps as the first target, and finish off secret (password) management.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This first part takes us &lt;strong&gt;as far as deciding the tool and the structure.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  1. Background — the things I'd stood up imperatively started to pile up
&lt;/h2&gt;

&lt;p&gt;When I brought up CloudNativePG in part 4, two kinds of commands were enough. I installed the operator with &lt;code&gt;helm&lt;/code&gt;, and brought up the database cluster with &lt;code&gt;kubectl apply&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Part 4 — install the operator&lt;/span&gt;
helm upgrade &lt;span class="nt"&gt;--install&lt;/span&gt; cnpg cnpg/cloudnative-pg &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; cnpg-system &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="nt"&gt;--wait&lt;/span&gt;

&lt;span class="c"&gt;# Part 4 — apply demo-db.yaml (Cluster CRD)&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; demo-db.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kubernetes broadly distinguishes three ways of managing objects — imperative commands (&lt;code&gt;kubectl create ...&lt;/code&gt;), imperative object configuration, and &lt;strong&gt;declarative object configuration (&lt;code&gt;kubectl apply -f&lt;/code&gt;)&lt;/strong&gt; (&lt;a href="https://kubernetes.io/docs/concepts/overview/working-with-objects/object-management/" rel="noopener noreferrer"&gt;Kubernetes — Object Management&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;The problem was &lt;strong&gt;where that YAML lived, and who applied it, and when.&lt;/strong&gt; In my case the file was somewhere on my laptop, the person applying it was me, and the timing was "whenever it crossed my mind."&lt;/p&gt;

&lt;p&gt;This is &lt;strong&gt;operating declarative manifests by hand&lt;/strong&gt;, and as the things I was bringing up grew one by one, I soon hit a wall.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ad0u9d3okeptimoov6s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ad0u9d3okeptimoov6s.png" alt="Even declarative manifests drift when run by hand" width="799" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  1-1. The walls I hit operating by hand
&lt;/h3&gt;

&lt;p&gt;An app soon sits on top of the DB, an ingress in front of it, secrets and backups beside it. And as I ran more and more services, this cluster has grown to &lt;strong&gt;248 pods across 49 namespaces.&lt;/strong&gt; Running that scale by hand with &lt;code&gt;apply&lt;/code&gt;, I ran into the following walls.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxm6xyop2hwc53m865ui7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxm6xyop2hwc53m865ui7.png" alt="Four things that break under hand-run ops" width="799" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Drift — the cluster becomes the only truth.&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;The moment you fix the cluster directly with &lt;code&gt;kubectl edit&lt;/code&gt; or &lt;code&gt;kubectl scale&lt;/code&gt; in a pinch, the original YAML and the actual state diverge. That change isn't recorded in any file, so re-&lt;code&gt;apply&lt;/code&gt;ing the original later silently overwrites it or conflicts.&lt;/li&gt;
&lt;li&gt;Kubernetes itself works by having controllers &lt;em&gt;continuously reconcile the current state toward the desired state&lt;/em&gt; (&lt;a href="https://kubernetes.io/docs/concepts/architecture/controller/" rel="noopener noreferrer"&gt;Kubernetes — Controllers&lt;/a&gt;), but if that "desired state" lives only in my head and in scattered files, there's no reference point for reconciliation at all.&lt;/li&gt;
&lt;li&gt;"What's running now is the real thing" — but that real thing isn't in code.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No history — you can't answer "why is it like this?"&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Why is &lt;code&gt;replicas&lt;/code&gt; 3, who added this env var, when, and why — manual ops keeps no record of it. You're left leaning on shell history and memory.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not reproducible — you can't rebuild the cluster.&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Tear down a node or build a new cluster, and you have to re-type all those &lt;code&gt;apply&lt;/code&gt;s again, in the right order and with the right dependencies. Thinking back to parts 1 and 2, where I tore down and rebuilt the cluster over and over, this wasn't somebody else's problem.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No audit or collaboration — there's no point to stop and review.&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;There's no review of a change, no approval, no revert. Hit Enter and it goes straight to production.&lt;/li&gt;
&lt;li&gt;Honestly, for a homelab I use alone, this item hurts the least. But the real point of running this homelab is &lt;strong&gt;to get hands-on with a way of working that transfers directly to a work environment — an enterprise cluster operated by many people.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The moment a team touches the same cluster, 'who / when / why' and review, approval, and rollback stop being optional and become &lt;strong&gt;essential&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;A single change can lead straight to an outage, and without a record to trace it, neither recovery nor accountability is possible.&lt;/li&gt;
&lt;li&gt;GitOps structurally removes this problem by making every change &lt;strong&gt;go through Git.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;GitLab explains that &lt;em&gt;a merge commit into the main (trunk) branch itself becomes the audit trail&lt;/em&gt;, and the Merge/Pull Request becomes the place where review, approval, and collaboration happen (&lt;a href="https://about.gitlab.com/topics/gitops/" rel="noopener noreferrer"&gt;GitLab — What is GitOps&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;This aligns with one of CNCF OpenGitOps' core principles, &lt;em&gt;"Versioned and Immutable"&lt;/em&gt; (a versioned, immutable history of state) (&lt;a href="https://opengitops.dev/" rel="noopener noreferrer"&gt;OpenGitOps Principles&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The biggest cause of these walls wasn't whether manifests existed — it was whether those manifests were &lt;strong&gt;gathered in one place as the single source of truth, and applied continuously and automatically.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Even declarative manifests bring drift, missing history, no reproducibility, and no audit when applied by hand. The point is to &lt;em&gt;gather declarations in one Git place and continuously automate how they're applied&lt;/em&gt; — and especially in an enterprise where many people touch the same cluster, audit and collaboration become essential.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  2. GitOps
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2-1. What GitOps is
&lt;/h3&gt;

&lt;p&gt;GitOps, in one line, is &lt;strong&gt;"an operating model where you declare the desired state in Git, and keep the cluster always matching that declaration."&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First, declare the desired state of the system (infrastructure and apps) &lt;strong&gt;with Git as the single source of truth.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Second, a &lt;strong&gt;software agent&lt;/strong&gt; inside the cluster automatically pulls that declaration and converges the actual state to it.

&lt;ul&gt;
&lt;li&gt;Humans only write "this is how it should be" into Git; reflecting and keeping it in the cluster is the agent's job.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Simply &lt;strong&gt;"putting YAML in Git" is not GitOps in itself.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The point is to nail Git down as the single source of truth, &lt;em&gt;so that the cluster cannot change without going through Git.&lt;/em&gt; If the path of typing &lt;code&gt;kubectl&lt;/code&gt; by hand stays open alongside it, Git is just a file store.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpat3l9954vtwaimjr9oj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpat3l9954vtwaimjr9oj.png" alt="GitOps concept — declare in Git, an agent converges automatically" width="800" height="366"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This approach was first named by Weaveworks' Alexis Richardson in a 2017 piece called &lt;em&gt;Operations by Pull Request&lt;/em&gt;, and today the &lt;strong&gt;CNCF's OpenGitOps&lt;/strong&gt; project standardizes and maintains its definition with four principles (&lt;a href="https://opengitops.dev/" rel="noopener noreferrer"&gt;OpenGitOps&lt;/a&gt;).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;① Declarative&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Write the desired state not as "do this" (a command) but as "it should be this way" (a declaration).&lt;/li&gt;
&lt;li&gt;Commands depend on order and timing; a declaration converges to the same result whenever it's applied.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;② Versioned and Immutable&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Keep that declaration somewhere, like Git, where versions remain and nothing can be changed arbitrarily.&lt;/li&gt;
&lt;li&gt;Every change is set in stone as a commit, leaving &lt;em&gt;who, when, and why&lt;/em&gt;, and a revert becomes a rollback.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;③ Pulled Automatically&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Rather than a human pushing it in, the agent &lt;em&gt;pulls the declaration itself.&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;④ Continuously Reconciled&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;The agent constantly observes the actual state and, when it diverges from the declaration, brings it back in line.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2-2. Why GitOps
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Drift → corrected by ③ auto-pull + ④ continuous reconciliation.&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;The agent constantly compares Git against the actual state, so even if someone diverges it by hand with &lt;code&gt;kubectl edit&lt;/code&gt;, the next reconciliation reverts it (self-heal).&lt;/li&gt;
&lt;li&gt;"What's running now is the truth" gives way to "Git is the truth."&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No history or audit → solved by ② versioning.&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Every change remains as a commit or Pull Request, so &lt;em&gt;who, when, and why&lt;/em&gt; is traceable, and there are points for review, approval, and rollback.&lt;/li&gt;
&lt;li&gt;Especially important in an enterprise where several people touch the same cluster.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not reproducible → solved by ① declaration + ② a single source.&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;The desired state of the entire cluster is declared in one place in Git, so even rebuilding the cluster restores the same shape by re-applying that declaration.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short, the value of GitOps isn't "because it's convenient" — it's that it &lt;strong&gt;structurally removes the problems that manual ops structurally carried.&lt;/strong&gt; And that value splits once more on safety, depending on &lt;em&gt;who&lt;/em&gt; performs ③ and ④ and &lt;em&gt;in which direction.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2-3. Push delivery and Pull delivery
&lt;/h3&gt;

&lt;p&gt;There are two ways to deliver a change to the actual cluster.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ttwgp50kems4zy77xh3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ttwgp50kems4zy77xh3.png" alt="Push delivery vs Pull delivery" width="800" height="459"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Push delivery&lt;/strong&gt; is where a CI/CD pipeline outside the cluster pushes the change into the cluster. The pipeline builds and then applies the manifests, and gitops.tech points out the limitation that this approach "is only triggered when the environment repository changes, and (the cluster's) deviation isn't noticed on its own" (&lt;a href="https://www.gitops.tech/" rel="noopener noreferrer"&gt;gitops.tech&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pull delivery&lt;/strong&gt; is where an agent (operator) &lt;em&gt;inside&lt;/em&gt; the cluster watches Git directly and pulls the change in. As gitops.tech describes it, the operator "continuously compares the desired state in the environment repository with the actual deployed state, and aligns the infrastructure if there's a difference." That this compare-and-correct never stops is the decisive difference from Push.&lt;/p&gt;

&lt;h3&gt;
  
  
  2-4. Why Pull is safer and more robust
&lt;/h3&gt;

&lt;p&gt;There are two clear reasons Pull is held up as the recommended GitOps approach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First, security — credentials never leave the cluster.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Push requires an external CI to hold privileged credentials to connect to the cluster.&lt;/li&gt;
&lt;li&gt;Pull, by contrast, has the deploying party inside the cluster, so "the external service doesn't need to know the credentials" (&lt;a href="https://www.gitops.tech/" rel="noopener noreferrer"&gt;gitops.tech&lt;/a&gt;), and the connection uses only outbound (egress) from the cluster.&lt;/li&gt;
&lt;li&gt;CNCF also, in a 2025 piece, summarizes the pull model's security benefit as "not exposing the cluster to external push traffic" (&lt;a href="https://www.cncf.io/blog/2025/06/09/gitops-in-2025-from-old-school-updates-to-the-modern-way/" rel="noopener noreferrer"&gt;CNCF, 2025&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Second, self-heal — it reverts deviation on its own.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Pull agent keeps comparing the actual state against Git, so even drift someone introduced directly with &lt;code&gt;kubectl edit&lt;/code&gt; is reverted at the next reconciliation.&lt;/li&gt;
&lt;li&gt;The drift that "a human had to revert" in section 1 is now reverted by the controller.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. What to do GitOps with — ArgoCD vs Flux vs Fleet
&lt;/h2&gt;

&lt;p&gt;In section 2 I explained what GitOps is (declare the desired state in Git → an agent converges to it automatically), why it fills the four limits of manual ops, and why the Pull approach is safer and recommended.&lt;/p&gt;

&lt;p&gt;Now I need to research &lt;em&gt;the tool that will actually run that Pull.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There are several Kubernetes GitOps tools, but the three I picked as serious candidates for real-world comparison are — &lt;strong&gt;ArgoCD, Flux CD, and Rancher Fleet.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All three share the same essence — "make Git the single source and match the cluster to that declaration" — but their character clearly splits on &lt;em&gt;how they structure controllers, how they divide CRDs, whether they embed a UI, and how many clusters they have in mind.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Let me go through each one's concept and architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  3-1. ArgoCD — an app-centric GitOps controller
&lt;/h3&gt;

&lt;p&gt;ArgoCD is part of the &lt;strong&gt;Argo project&lt;/strong&gt; that Intuit built and donated, and the official docs define it as "a declarative GitOps continuous delivery tool for Kubernetes" (&lt;a href="https://argo-cd.readthedocs.io/en/stable/" rel="noopener noreferrer"&gt;Argo CD Docs&lt;/a&gt;). The &lt;code&gt;application-controller&lt;/code&gt; that handles reconcile (matching the current "actual" state to the "desired" state), the &lt;code&gt;repo-server&lt;/code&gt; that caches and renders Git, and the &lt;code&gt;server&lt;/code&gt; that provides the API and UI all work as one suite.&lt;/p&gt;

&lt;p&gt;Below is ArgoCD's detailed architecture drawn against the latest stable version (v3.4.3 — the same version my cluster runs). On top of the API, Repo, and Application three cores come ApplicationSet, Redis, and Dex; the Repository Server pulls and renders Git, and the Application Controller compares it against live and syncs to the cluster.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3lpxc9o64yz3z9p5kisf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3lpxc9o64yz3z9p5kisf.png" alt="ArgoCD architecture v3.4.3" width="800" height="469"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;ArgoCD has two distinctive traits.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Everything is grouped into "apps" centered on a CRD called &lt;code&gt;Application&lt;/code&gt;.&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;You declare "sync this Git path to this namespace of this cluster" in a single &lt;code&gt;Application&lt;/code&gt;, and scale up with the &lt;strong&gt;app-of-apps&lt;/strong&gt; pattern, where one app owns many, or with &lt;strong&gt;ApplicationSet&lt;/strong&gt;, which auto-generates apps from a template.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It ships with a rich Web UI built in.&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;The official docs cite "a Web UI that shows application activity in real time" as a core feature, and you handle sync status, diff, and rollback, plus SSO (OIDC) and RBAC, right from the screen (&lt;a href="https://argo-cd.readthedocs.io/en/stable/" rel="noopener noreferrer"&gt;Argo CD Docs&lt;/a&gt;). Its maturity is solid too. Argo entered CNCF incubation in 2020 and reached &lt;strong&gt;Graduated on December 6, 2022&lt;/strong&gt; (&lt;a href="https://www.cncf.io/announcements/2022/12/06/the-cloud-native-computing-foundation-announces-argo-has-graduated/" rel="noopener noreferrer"&gt;CNCF — Argo Graduated&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3-2. Flux CD — a composable GitOps toolkit
&lt;/h3&gt;

&lt;p&gt;Flux was originally built by Weaveworks, and in v2 it was rewritten on top of the Kubernetes controller-runtime and its own &lt;strong&gt;GitOps Toolkit&lt;/strong&gt;. The official site introduces Flux as "a set of continuous and progressive delivery solutions for Kubernetes that are open and extensible" (&lt;a href="https://fluxcd.io/" rel="noopener noreferrer"&gt;Flux&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;That word "set" captures Flux's character well.&lt;/p&gt;

&lt;p&gt;If ArgoCD is one suite of controllers, &lt;strong&gt;Flux is a combination of several controllers split by purpose.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Below is Flux's detailed architecture per the latest version (v2.8) official docs.&lt;/p&gt;

&lt;p&gt;Six controllers each own their CRDs; the source-controller pulls sources and exposes them as artifacts, the kustomize/helm controllers apply them via SSA, and the image controllers commit new images back to Git, closing the loop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh30i7zki0y8z0upv9xrh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh30i7zki0y8z0upv9xrh.png" alt="Flux CD architecture v2.8" width="800" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Concretely, &lt;code&gt;source-controller&lt;/code&gt; (acquiring sources: Git, Helm, OCI, S3, etc.), &lt;code&gt;kustomize-controller&lt;/code&gt; (applying Kustomize), &lt;code&gt;helm-controller&lt;/code&gt; (Helm releases), &lt;code&gt;notification-controller&lt;/code&gt; (notifications), and the image automation controllers each own their own &lt;strong&gt;CRDs&lt;/strong&gt; (&lt;code&gt;GitRepository&lt;/code&gt;, &lt;code&gt;Kustomization&lt;/code&gt;, &lt;code&gt;HelmRelease&lt;/code&gt;, etc.) and collaborate (&lt;a href="https://fluxcd.io/" rel="noopener noreferrer"&gt;Flux&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;This finely divided structure has the upside of free composition and extension and a light cluster footprint, while differing in that &lt;strong&gt;there's no officially built-in Web UI.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You mostly check status with the CLI (&lt;code&gt;flux&lt;/code&gt;), and if you need a screen you attach a separate ecosystem UI or a vendor-hosted product (&lt;a href="https://fluxcd.io/" rel="noopener noreferrer"&gt;Flux&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Its maturity is neck and neck with ArgoCD. Flux also reached &lt;strong&gt;CNCF Graduated on November 30, 2022&lt;/strong&gt; (&lt;a href="https://www.cncf.io/announcements/2022/11/30/flux-graduates-from-cncf-incubator/" rel="noopener noreferrer"&gt;CNCF — Flux Graduated&lt;/a&gt;), so the two tools graduated less than a week apart, with maturity that stands shoulder to shoulder.&lt;/p&gt;

&lt;h3&gt;
  
  
  3-3. Rancher Fleet — GitOps for hundreds of clusters
&lt;/h3&gt;

&lt;p&gt;The third is SUSE's &lt;strong&gt;Rancher Fleet.&lt;/strong&gt; Its starting point differs from the other two.&lt;/p&gt;

&lt;p&gt;If ArgoCD and Flux start from "GitOps for one cluster" and expand toward multi-cluster, &lt;strong&gt;Fleet is designed from the start to aim at "large-scale multi-cluster."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AWS's guidance docs also introduce Fleet as a "GitOps-at-scale" tool "built to scale from a single cluster to thousands" (&lt;a href="https://docs.aws.amazon.com/prescriptive-guidance/latest/eks-gitops-tools/rancher-fleet.html" rel="noopener noreferrer"&gt;AWS — Rancher Fleet&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Its operating model is tuned to that purpose. The &lt;strong&gt;Fleet Manager&lt;/strong&gt; on the management (upstream) cluster packages the contents of a repo pointed to by &lt;code&gt;GitRepo&lt;/code&gt; into a &lt;code&gt;Bundle&lt;/code&gt;, then &lt;strong&gt;fans out to many downstream clusters&lt;/strong&gt; according to group and target settings (&lt;a href="https://fleet.rancher.io/how-tos-for-users/gitrepo-targets" rel="noopener noreferrer"&gt;Fleet — Mapping to Downstream Clusters&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;You usually manage it all from Rancher's Continuous Delivery screen. It's a powerful model for an MSP or a large enterprise where clusters are scattered by the dozens or hundreds across data centers, regions, and customers.&lt;/p&gt;

&lt;p&gt;That said, unlike ArgoCD and Flux, you should also note that Fleet &lt;strong&gt;is not a CNCF project but part of the SUSE Rancher ecosystem.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Below is Fleet's detailed architecture per the latest version (v0.15.0) official docs.&lt;/p&gt;

&lt;p&gt;The upstream gitjob and fleet-controller turn Git into a Bundle and create per-target BundleDeployments, and each downstream's fleet-agent pulls them outbound and applies them (the controller never connects to downstream first, so it works behind NAT and firewalls).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fet68o9ui277rvqn6ceoi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fet68o9ui277rvqn6ceoi.png" alt="Rancher Fleet architecture v0.15.0" width="800" height="440"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3-4. Comparison in one table
&lt;/h3&gt;

&lt;p&gt;Lining the three up on the same axes makes the differences clear.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbpxwfw4u1qgwbj0yftcd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbpxwfw4u1qgwbj0yftcd.png" alt="GitOps controller comparison — ArgoCD vs Flux vs Fleet" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;ArgoCD&lt;/th&gt;
&lt;th&gt;Flux CD&lt;/th&gt;
&lt;th&gt;Rancher Fleet&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;One-line definition&lt;/td&gt;
&lt;td&gt;app-centric GitOps controller&lt;/td&gt;
&lt;td&gt;composable GitOps toolkit&lt;/td&gt;
&lt;td&gt;large-scale multi-cluster GitOps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Controller layout&lt;/td&gt;
&lt;td&gt;one suite (controller · repo · server)&lt;/td&gt;
&lt;td&gt;per-purpose controllers (GitOps Toolkit)&lt;/td&gt;
&lt;td&gt;Fleet Manager → downstream&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Core CRDs&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Application&lt;/code&gt; · &lt;code&gt;ApplicationSet&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;GitRepository&lt;/code&gt; · &lt;code&gt;Kustomization&lt;/code&gt; · &lt;code&gt;HelmRelease&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;GitRepo&lt;/code&gt; · &lt;code&gt;Bundle&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Web UI&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;built-in&lt;/strong&gt; (status · diff · rollback · OIDC/RBAC)&lt;/td&gt;
&lt;td&gt;none official (CLI + ecosystem UI)&lt;/td&gt;
&lt;td&gt;Rancher UI integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Target scale&lt;/td&gt;
&lt;td&gt;single~multi cluster&lt;/td&gt;
&lt;td&gt;single~multi cluster&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;hundreds~thousands of clusters&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Governance&lt;/td&gt;
&lt;td&gt;CNCF &lt;strong&gt;Graduated&lt;/strong&gt; (2022-12)&lt;/td&gt;
&lt;td&gt;CNCF &lt;strong&gt;Graduated&lt;/strong&gt; (2022-11)&lt;/td&gt;
&lt;td&gt;SUSE Rancher (non-CNCF)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Strength&lt;/td&gt;
&lt;td&gt;app-level visibility · UI&lt;/td&gt;
&lt;td&gt;lightweight · composability&lt;/td&gt;
&lt;td&gt;large-scale cluster fan-out&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On maturity (both CNCF Graduated) and on core behavior like Pull and self-heal, ArgoCD and Flux are effectively on par. The real fork was &lt;strong&gt;"do you work with apps through a screen (ArgoCD) vs compose controllers and work through the CLI (Flux)"&lt;/strong&gt;, and for Fleet, &lt;strong&gt;"how many clusters do you have."&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3-5. So why ArgoCD?
&lt;/h3&gt;

&lt;p&gt;My choice was &lt;strong&gt;ArgoCD.&lt;/strong&gt; But this wasn't a question of "which of the three is superior" — it was a decision driven by &lt;strong&gt;my homelab's context.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fladhratsljiapdz452ew.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fladhratsljiapdz452ew.png" alt="Why ArgoCD — a homelab-context decision" width="799" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There were three criteria.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First, &lt;strong&gt;my environment has one cluster&lt;/strong&gt; (6 nodes, but a single cluster).

&lt;ul&gt;
&lt;li&gt;Fleet's strength of fanning out to hundreds of clusters has no use for me; if anything, that management model is overkill.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Second, I had a strong desire to &lt;strong&gt;"see with my own eyes what had diverged."&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;ArgoCD's built-in UI, where you can instantly check drift and sync status, diff and rollback on a screen, was more intuitive than CLI-centric Flux, both for learning and for operating.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Third, since the point of this homelab is &lt;strong&gt;to learn a way of working I can someday move to an enterprise&lt;/strong&gt;, I needed a tool with rich material and examples and a guaranteed lifespan.

&lt;ul&gt;
&lt;li&gt;It was also a tool I'd grown familiar with from using it on projects, and one I wanted to dig into more deeply.&lt;/li&gt;
&lt;li&gt;ArgoCD is CNCF Graduated, and in CNCF's &lt;strong&gt;2025 ArgoCD end-user survey&lt;/strong&gt;, adoption was overwhelming — about 60% of respondents' clusters deploy applications with ArgoCD (&lt;a href="https://www.cncf.io/announcements/2025/07/24/cncf-end-user-survey-finds-argo-cd-as-majority-adopted-gitops-solution-for-kubernetes/" rel="noopener noreferrer"&gt;CNCF, 2025&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Flux is an excellent tool too. With equal maturity (CNCF Graduated), lighter and freely composable, it may actually suit a team that wants to automate ops in a CLI- and code-centric way even better. I personally prefer operating via CLI over a UI as well, but for &lt;em&gt;my&lt;/em&gt; conditions — "I want to work with apps through a screen, and I'm practicing an enterprise move by running a single cluster for the long haul" — ArgoCD simply fit a notch better. Right now, ArgoCD runs on this cluster reconciling 79 apps.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flblx8cukllzcoaput503.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flblx8cukllzcoaput503.png" alt="ArgoCD reconcile loop" width="800" height="400"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ kubectl get statefulset,deploy -n argocd
NAME                                             READY
statefulset.apps/argocd-application-controller    1/1     # reconcile engine (pull/compare/sync)
deployment.apps/argocd-repo-server                1/1     # Git manifest cache/render
deployment.apps/argocd-server                     1/1     # API / UI
deployment.apps/argocd-applicationset-controller  1/1     # the auto-generator covered in a later part
# … dex / redis / notifications / image-updater

$ kubectl get applications -n argocd --no-headers | wc -l
79
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And you can confirm that &lt;code&gt;selfHeal&lt;/code&gt; is enabled for this reconcile too.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ kubectl get application root -n argocd -o jsonpath='{.spec.syncPolicy.automated}'
{"prune":true,"selfHeal":true}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. What structure to build ArgoCD with — rendering, organization, repository, access
&lt;/h2&gt;

&lt;p&gt;Deciding on ArgoCD doesn't mean I can put part 4's CloudNativePG into Git right away. Following the flow of GitOps —&lt;/p&gt;

&lt;p&gt;&lt;em&gt;what and how do I write into Git → how does ArgoCD pull it → and apply it to the cluster&lt;/em&gt; — points to decide on appear one after another along that path. Summarized, there are four.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;In what format&lt;/strong&gt; do I write the manifests — &lt;em&gt;rendering&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;How do I &lt;strong&gt;register and manage&lt;/strong&gt; those apps in ArgoCD — &lt;em&gt;organization&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Where (in what repository structure)&lt;/strong&gt; do I keep those files — &lt;em&gt;repository&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How does ArgoCD access&lt;/strong&gt; that repository — &lt;em&gt;access (auth)&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Only once these four are decided does the 'structure' to move CNPG into GitOps stand up. From here, axis by axis, let me look at &lt;em&gt;what it is, why it must be decided, what the candidates are, and on what basis to choose.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4-1. Manifest rendering — Kustomize vs Helm vs plain
&lt;/h3&gt;

&lt;p&gt;First, let me clarify what "manifest rendering" is.&lt;/p&gt;

&lt;p&gt;To bring anything up in Kubernetes, you ultimately need &lt;strong&gt;YAML (a manifest)&lt;/strong&gt; describing the target — a Deployment, a Service, a ConfigMap.&lt;/p&gt;

&lt;p&gt;But even for the same app, replicas and image tags differ per environment (dev/prod), and similar apps multiply into many copies. At this point, "how you keep the source written, and how you produce the &lt;strong&gt;final YAML&lt;/strong&gt; that actually goes into the cluster" — this process is what we call &lt;strong&gt;rendering.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ArgoCD doesn't force this rendering into one way; it looks at the files in the repo path and decides automatically.&lt;/p&gt;

&lt;p&gt;If there's a &lt;code&gt;kustomization.yaml&lt;/code&gt; it's Kustomize, if there's a &lt;code&gt;Chart.yaml&lt;/code&gt; it's Helm, and if neither, it's plain YAML (plain) (&lt;a href="https://argo-cd.readthedocs.io/en/stable/" rel="noopener noreferrer"&gt;Argo CD Docs&lt;/a&gt;). So what we decide is "which of these three to write my manifests in." Let me look at how the three differ for the same goal (the same app at replicas 1 in dev, 3 in prod).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;plain YAML — as-is, no processing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;plain is literally no processing. Put fully filled-in, complete YAML like &lt;code&gt;deployment.yaml&lt;/code&gt; and &lt;code&gt;service.yaml&lt;/code&gt; in a directory, and ArgoCD's repo-server applies it to the cluster &lt;em&gt;unchanged, as-is.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There's no rendering step at all, so &lt;strong&gt;the characters written in Git are exactly the cluster's state&lt;/strong&gt;, which means what gets deployed reads without doubt and there's no syntax to learn. It is "declarative object configuration (&lt;code&gt;kubectl apply -f&lt;/code&gt;)" itself (&lt;a href="https://kubernetes.io/docs/concepts/overview/working-with-objects/object-management/" rel="noopener noreferrer"&gt;Kubernetes — Object Management&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;The problem is &lt;strong&gt;repetition.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To keep environments like dev and prod that differ by just a line or two, you copy the whole file (splitting into separate directories or such) and fix only those lines, and even bumping one shared image tag means hand-editing every copied file.&lt;/p&gt;

&lt;p&gt;The more targets, the faster duplication and omissions (fixing only one side) pile up. The image below shows that limit — two nearly identical files exist separately because of one &lt;code&gt;replicas&lt;/code&gt; line.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqnks1gxmhyyxtwwb7e71.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqnks1gxmhyyxtwwb7e71.png" alt="plain YAML rendering" width="800" height="380"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kustomize — layering with base + overlay (no templates)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Kustomize solves that repetition &lt;em&gt;without copy-paste.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The Kubernetes official docs define it as "a standalone tool to customize Kubernetes objects through a kustomization file," and since 1.14 it's built into kubectl, usable directly with &lt;code&gt;kubectl apply -k&lt;/code&gt; (&lt;a href="https://kubernetes.io/docs/tasks/manage-kubernetes-objects/kustomization/" rel="noopener noreferrer"&gt;Kubernetes — Kustomize&lt;/a&gt;). The core concepts are &lt;strong&gt;base and overlay.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Keep one copy of the shared manifest in &lt;code&gt;base/&lt;/code&gt; (e.g. a Deployment with replicas 1), and write only the per-environment differences as a &lt;strong&gt;patch&lt;/strong&gt; in &lt;code&gt;overlays/prod/&lt;/code&gt;'s &lt;code&gt;kustomization.yaml&lt;/code&gt; (e.g. "replicas to 3," "prefix names with prod-"). Then &lt;code&gt;kustomize build&lt;/code&gt; reads the base and overlays the patch to produce the final YAML.&lt;/p&gt;

&lt;p&gt;The decisive trait is that &lt;strong&gt;there's no template language&lt;/strong&gt; — it's not variable substitution like &lt;code&gt;{{ }}&lt;/code&gt; but &lt;em&gt;merging YAML on top of YAML&lt;/em&gt; to make another plain YAML, so the result reads as-is and "what changed and how" is visible.&lt;/p&gt;

&lt;p&gt;Shared values (image tags, etc.) reflect to every overlay by fixing just one place in base, so plain's copy-paste problem disappears. In return, it's weak at complex expressions like conditional branching or repeated generation.&lt;/p&gt;

&lt;p&gt;The image below shows, with real files, how one base + a prod overlay's patch merge into the final YAML.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg889puzqxw5ypn6g77vf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg889puzqxw5ypn6g77vf.png" alt="Kustomize rendering" width="800" height="380"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Helm — parameterizing with charts and values (a template engine)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Helm takes a different approach.&lt;/p&gt;

&lt;p&gt;Calling itself "the package manager for Kubernetes" (&lt;a href="https://helm.sh/" rel="noopener noreferrer"&gt;Helm&lt;/a&gt;), it bundles an application into a package called a &lt;strong&gt;chart.&lt;/strong&gt; The manifests in the chart's &lt;code&gt;templates/&lt;/code&gt; don't write values directly but place &lt;strong&gt;Go template variable slots&lt;/strong&gt; like &lt;code&gt;{{ .Values.replicas }}&lt;/code&gt;, and the actual values are written separately in &lt;code&gt;values.yaml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;At deploy time, Helm slots the values into place (substitution) to render the final manifests.&lt;/p&gt;

&lt;p&gt;So &lt;strong&gt;keeping the chart and just swapping the values&lt;/strong&gt; lets you deploy with the same chart to dev and prod, and even other clusters.&lt;/p&gt;

&lt;p&gt;On top of this, it supports conditionals (&lt;code&gt;if&lt;/code&gt;), loops (&lt;code&gt;range&lt;/code&gt;), shared helpers (&lt;code&gt;_helpers.tpl&lt;/code&gt;), and dependencies on other charts (subcharts), giving it the strongest expressiveness.&lt;/p&gt;

&lt;p&gt;That makes it especially good for &lt;strong&gt;pulling in complex software someone else has published, chart and all, and changing only the values to fit my environment&lt;/strong&gt; (public charts usually come from official chart repositories).&lt;/p&gt;

&lt;p&gt;Because a template language sits in between, "the text written in Git" and "the YAML that will actually be applied" are one step apart. So you need the habit of expanding the render result in advance with &lt;code&gt;helm template&lt;/code&gt; to check it.&lt;/p&gt;

&lt;p&gt;The image below shows the process of variable slots (&lt;code&gt;{{ }}&lt;/code&gt;) being substituted with values into the final YAML.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F01pt58ravnqnbgt4a81x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F01pt58ravnqnbgt4a81x.png" alt="Helm rendering" width="800" height="380"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;plain YAML&lt;/th&gt;
&lt;th&gt;Kustomize&lt;/th&gt;
&lt;th&gt;Helm&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Processing&lt;/td&gt;
&lt;td&gt;none (apply as-is)&lt;/td&gt;
&lt;td&gt;base + overlay patch·merge&lt;/td&gt;
&lt;td&gt;Go template variable substitution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Template language&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;yes (&lt;code&gt;{{ }}&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parameterization·expressiveness&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;medium (patch·field injection)&lt;/td&gt;
&lt;td&gt;high (conditionals·loops·dependencies)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output transparency&lt;/td&gt;
&lt;td&gt;highest (source = result)&lt;/td&gt;
&lt;td&gt;high (YAML→YAML)&lt;/td&gt;
&lt;td&gt;low (must render to see)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Built into kubectl&lt;/td&gt;
&lt;td&gt;apply only (&lt;code&gt;-f&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;yes (&lt;code&gt;-k&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;no (separate tool)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reuse·distribution&lt;/td&gt;
&lt;td&gt;low (copy-paste)&lt;/td&gt;
&lt;td&gt;medium (base reuse)&lt;/td&gt;
&lt;td&gt;high (shared via charts·repos)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Learning·complexity&lt;/td&gt;
&lt;td&gt;lowest&lt;/td&gt;
&lt;td&gt;low&lt;/td&gt;
&lt;td&gt;medium~high&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best fit&lt;/td&gt;
&lt;td&gt;a few static resources&lt;/td&gt;
&lt;td&gt;manifests I declare myself&lt;/td&gt;
&lt;td&gt;complex external public charts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In short, the three aren't a matter of better or worse but of &lt;strong&gt;purpose&lt;/strong&gt; — plain for a small number of static resources, template-free and clean Kustomize for manifests I declare myself, and Helm for pulling in complex external public charts.&lt;/p&gt;

&lt;h3&gt;
  
  
  4-2. App organization — app-of-apps vs ApplicationSet
&lt;/h3&gt;

&lt;p&gt;Next is "how to register those written manifests in ArgoCD."&lt;/p&gt;

&lt;p&gt;ArgoCD handles the deploy unit as a CRD called &lt;code&gt;Application&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It's a single sheet that says "sync this Git path to this namespace of this cluster." With one or two apps, you can write these &lt;code&gt;Application&lt;/code&gt;s by hand, one at a time. But once you have dozens to bring up, &lt;em&gt;making the Applications themselves&lt;/em&gt; becomes work, and it's easy to miss some or let them drift apart.&lt;/p&gt;

&lt;p&gt;So you have to decide "how to create Applications systematically (automatically, if possible)." ArgoCD offers two paths.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;app-of-apps&lt;/strong&gt; is, in the official docs' exact words, a pattern that "declares one ArgoCD app consisting only of other apps" (&lt;a href="https://argo-cd.readthedocs.io/en/stable/operator-manual/cluster-bootstrapping/" rel="noopener noreferrer"&gt;Argo CD — Cluster Bootstrapping&lt;/a&gt;). You write a list of child &lt;code&gt;Application&lt;/code&gt;s into one parent &lt;strong&gt;root Application&lt;/strong&gt;, and syncing just that root creates the children one after another.&lt;/p&gt;

&lt;p&gt;It suits a bootstrap entry point that "stands up the cluster's skeleton in one shot." But because you &lt;em&gt;write the child list directly&lt;/em&gt;, you have to add one more child Application by hand each time a new app appears.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ApplicationSet&lt;/strong&gt; goes one step further — in the official docs' words, a controller that "automates and flexibly manages Applications across many clusters and apps" (&lt;a href="https://argo-cd.readthedocs.io/en/stable/operator-manual/applicationset/" rel="noopener noreferrer"&gt;Argo CD — ApplicationSet&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;The core is the &lt;strong&gt;generator.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A generator produces parameters, and those parameters are slotted into a single template to &lt;em&gt;stamp out&lt;/em&gt; Applications. Generators include list (giving the list directly), cluster (scanning registered clusters), &lt;strong&gt;git&lt;/strong&gt; (scanning a repo's folders and files), and matrix (multiplying two together).&lt;/p&gt;

&lt;p&gt;In particular, the &lt;strong&gt;git generator&lt;/strong&gt; automatically creates an Application for each folder under a set path (e.g. &lt;code&gt;apps/*&lt;/code&gt;), so &lt;em&gt;just adding a new folder makes the app appear on its own.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There's no need for a human to create the Application directly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fni6kah5rkolt41vj82cg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fni6kah5rkolt41vj82cg.png" alt="App organization — app-of-apps vs ApplicationSet" width="800" height="366"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The two aren't so much competitors as &lt;strong&gt;different layers.&lt;/strong&gt; app-of-apps is good at making "one initial entry point," ApplicationSet at "mass-producing apps beneath it."&lt;/p&gt;

&lt;p&gt;So they're commonly used together — root (app-of-apps) stands up the ApplicationSets, and each ApplicationSet scans folders to stamp out the actual apps.&lt;/p&gt;

&lt;h3&gt;
  
  
  4-3. Repository structure — monorepo vs polyrepo
&lt;/h3&gt;

&lt;p&gt;Third is "in &lt;em&gt;which repository&lt;/em&gt; to keep those declarations."&lt;/p&gt;

&lt;p&gt;There's an order to this. There's a &lt;strong&gt;principle&lt;/strong&gt; to note first, and then you decide &lt;em&gt;whether to keep that repository as one or split it into many.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The principle is to &lt;strong&gt;separate config (manifests) from app source code.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The ArgoCD official guide nails it down, &lt;em&gt;"strongly recommending that Kubernetes manifests live in a separate Git repository from the application source code"&lt;/em&gt; (&lt;a href="https://argo-cd.readthedocs.io/en/stable/user-guide/best_practices/" rel="noopener noreferrer"&gt;Argo CD — Best Practices&lt;/a&gt;). The reasons follow.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;① If source and config are in one repo, it's easy to create an infinite loop where changing only config re-runs the app's build CI.&lt;/li&gt;
&lt;li&gt;② Deploy history (config commits) and development history (source commits) get tangled, making the audit log messy.&lt;/li&gt;
&lt;li&gt;③ It's hard to separate the permissions of "people who touch the code" and "people who deploy to production."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What remains is how to keep that "config repository."&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;monorepo&lt;/strong&gt; gathers the entire cluster's declarations in one repo, separated by folders (&lt;code&gt;platform/&lt;/code&gt;, &lt;code&gt;workloads/&lt;/code&gt;, …)

&lt;ul&gt;
&lt;li&gt;The whole picture of changes fits in one view, and the ApplicationSet's git generator only needs to scan one repo, keeping things simple&lt;/li&gt;
&lt;li&gt;But once an organization gets very large, it's limited at finely dividing permissions like "this folder for this team only."&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;polyrepo&lt;/strong&gt; splits config repos per team or domain

&lt;ul&gt;
&lt;li&gt;You can cleanly divide access permissions per repo, but it gets cumbersome to see the whole cluster at once or to make changes spanning multiple repos. (Better or worse between these two isn't a matter of a right answer so much as a trade-off driven by org size and permission needs.)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxwt1fi654wier7l5e127.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxwt1fi654wier7l5e127.png" alt="Repository structure — separate source · monorepo vs polyrepo" width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4-4. Repository access — HTTPS vs SSH deploy key vs GitHub App
&lt;/h3&gt;

&lt;p&gt;The last is &lt;em&gt;how ArgoCD reads that (usually private) repository.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Per the Pull model from section 2, ArgoCD's &lt;code&gt;repo-server&lt;/code&gt;, the party doing the reading, is inside the cluster and uses only outbound connections. Still, reading a private repo needs credentials, and each method differs in &lt;em&gt;the scope its permission reaches&lt;/em&gt; and &lt;em&gt;whether it can be narrowed to read-only.&lt;/em&gt; The official docs support HTTPS (user/token), SSH private key (deploy key), GitHub App, TLS client certificates, and more (&lt;a href="https://argo-cd.readthedocs.io/en/stable/user-guide/private-repositories/" rel="noopener noreferrer"&gt;Argo CD — Private Repositories&lt;/a&gt;).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HTTPS · Personal Access Token&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;You attach using the token like a password.&lt;/li&gt;
&lt;li&gt;Simplest, but the token easily broadens to reach many repos at the account level, and usually carries read/write permission together.&lt;/li&gt;
&lt;li&gt;If leaked, the blast radius is large.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSH · Deploy Key&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;GitHub's official docs define a deploy key as &lt;em&gt;"an SSH key that grants access to a single repository,"&lt;/em&gt; and specify that &lt;em&gt;"it's read-only by default, and write access can be granted when adding it"&lt;/em&gt; (&lt;a href="https://docs.github.com/en/authentication/connecting-to-github-with-ssh/managing-deploy-keys" rel="noopener noreferrer"&gt;GitHub — Deploy keys&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;That is, &lt;em&gt;its scope is limited to that one repository&lt;/em&gt; and it &lt;em&gt;can be issued read-only&lt;/em&gt;, fitting the principle of least privilege best.&lt;/li&gt;
&lt;li&gt;You register the public key as the repository's deploy key and put the private key into ArgoCD.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub App&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;A fine-grained method that reaches &lt;em&gt;only chosen repos and permissions&lt;/em&gt; per installation.&lt;/li&gt;
&lt;li&gt;The installation token is a short-lived token that expires in about an hour, so you use it with auto-renewal (&lt;a href="https://docs.github.com/en/apps/creating-github-apps/authenticating-with-a-github-app/authenticating-as-a-github-app-installation" rel="noopener noreferrer"&gt;GitHub — App installation auth&lt;/a&gt;), and auditing and revocation stay clean.&lt;/li&gt;
&lt;li&gt;Suits org and many-repo scale, but the initial setup is somewhat complex.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrkbhbqzak18f3y31w71.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrkbhbqzak18f3y31w71.png" alt="Repository access — HTTPS · SSH deploy key · GitHub App" width="800" height="366"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Whichever method, the credential stays only inside the cluster and uses only outbound connections — that's the same. What splits is "how far you can narrow the permission," and for a single repository, the deploy key, &lt;em&gt;limited to that repo and read-only&lt;/em&gt;, is the simplest and safest.&lt;/p&gt;

&lt;h3&gt;
  
  
  4-5. What does my homelab's setup look like? — what, why, and so what do I gain
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9qy8vjbn88aux63782ct.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9qy8vjbn88aux63782ct.png" alt="My setup's character and how it scales — simple·safe·reproducible, axis-by-axis" width="800" height="407"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rendering&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Kustomize by default + Helm alongside.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;The resources I declare myself have almost no per-environment branching, so Kustomize, where &lt;em&gt;the result YAML is visible as-is&lt;/em&gt; with no template language, was simple and easy to debug.&lt;/li&gt;
&lt;li&gt;Conversely, charts &lt;em&gt;someone else made well and published&lt;/em&gt; — like operators — I don't bother unpacking and porting; I pull them in as-is with Helm.&lt;/li&gt;
&lt;li&gt;What I use myself stays transparent, what others made gets reused, and the management burden is minimized on both sides.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Organization&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;app-of-apps + ApplicationSet.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;I keep root (app-of-apps) as the bootstrap entry point, and beneath it the ApplicationSets scan folders with a git generator to mass-produce apps.&lt;/li&gt;
&lt;li&gt;Adding an app needs no hand-made Application (just add a folder), and rebuilding the cluster restores everything from the single root.&lt;/li&gt;
&lt;li&gt;Reproducibility and scalability come together.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repository&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;A config monorepo separated from source.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Manifests are separated from app source (the official recommendation), but on a single-cluster homelab I gathered that config into one repo.&lt;/li&gt;
&lt;li&gt;It avoids CI loops, history tangling, and permission issues through separation, while seeing all changes in one view and keeping the generator simple.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;A read-only SSH deploy key.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;On top of the Pull model that keeps credentials only inside the cluster, I narrowed the permission to &lt;em&gt;that one repository, read-only.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Even if the key leaks, it can't do more than read that repo, so the blast radius is structurally bound.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This setup's &lt;strong&gt;character&lt;/strong&gt;, in one line, is &lt;em&gt;"single cluster, one config monorepo, read-only pull"&lt;/em&gt; — simple, safe, and reproducible as a whole.&lt;/p&gt;

&lt;p&gt;At the same time, changing just one axis at a time makes it a foundation that scales to an enterprise (repository to polyrepo, access to GitHub App, target to multi-cluster).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ArgoCD repo connection secret — the key layout alone tells you the method.
$ kubectl get secret repo-seonology-k3s -n argocd -o jsonpath='{.data}' | jq 'keys'
[
  "sshPrivateKey",   # connect via SSH deploy key — credentials stay in the cluster
  "type",            # git
  "url"              # exactly one repo = monorepo
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqdgw4mvvuperviooehh5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqdgw4mvvuperviooehh5.png" alt="My GitOps setup — from monorepo to cluster" width="800" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These decisions together finish the &lt;em&gt;preparation&lt;/em&gt; to move part 4's imperatively-built CloudNativePG into GitOps. In the next part (part 6), I'll actually build this design by hand.&lt;/p&gt;

&lt;p&gt;From &lt;strong&gt;creating the config repository → installing ArgoCD → connecting the repository with a read-only deploy key → standing up the root (app-of-apps) and ApplicationSet skeleton.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And in part 7, I'll lay CNPG on top of it — declaring the operator with Helm and the &lt;code&gt;Cluster&lt;/code&gt; CR with Kustomize — and take it all the way to running it as GitOps.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Even after deciding to use ArgoCD, four more things need deciding — rendering (Kustomize by default + Helm alongside), organization (app-of-apps for bootstrap + ApplicationSet for mass production), repository (a config monorepo separated from source), access (a read-only SSH deploy key). The reason for each choice converges into one — "simple, safe, reproducible" — and at the same time becomes a foundation that scales to an enterprise by changing just one axis at a time.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  5. Wrapping up — and what's next
&lt;/h2&gt;

&lt;p&gt;This part didn't add a single line of command (&lt;code&gt;kubectl apply&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Instead, it finished the &lt;strong&gt;design&lt;/strong&gt; for putting that command down. I confirmed that GitOps's four principles fill exactly the four things that broke under hand-run ops (drift, history, reproducibility, audit) (sections 1 and 2), chose ArgoCD as the tool to run that Pull (section 3), and decided &lt;em&gt;how to write, group, keep, and read&lt;/em&gt; the manifests on top of it, along four axes (section 4).&lt;/p&gt;

&lt;p&gt;Spread out, it looks like a lot of decisions, but they all converge in one direction — &lt;strong&gt;simple, safe, and reproducible.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Rendering keeps Kustomize, whose result is visible as-is, by default while pulling in complex charts others made with Helm; organization uses app-of-apps + ApplicationSet, where everything is restored from a single root; the repository is a config monorepo separated from source; access is a read-only deploy key that reads only that one repo.&lt;/p&gt;

&lt;p&gt;It's a plan that's both the simplest, safest starting point for running a single cluster, and one you can scale to an enterprise by changing just one axis at a time.&lt;/p&gt;

&lt;p&gt;With the design done, from the next part on I build this by hand.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 6 · Bootstrap&lt;/strong&gt; — install ArgoCD, connect the config repository with a read-only deploy key, then stand up the cluster's skeleton with root (app-of-apps) and ApplicationSet. And I'll see with my own eyes how ArgoCD reverts a change someone made directly with &lt;code&gt;kubectl edit&lt;/code&gt; (self-heal).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 7 · Apply&lt;/strong&gt; — move the CloudNativePG I brought up imperatively in part 4 into GitOps. I'll declare the operator with Helm and the &lt;code&gt;Cluster&lt;/code&gt; CR with Kustomize, and finish off the last remaining homework, secret (password) management.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's time to make a single commit take the place where I used to type &lt;code&gt;kubectl apply&lt;/code&gt; by hand.&lt;/p&gt;

&lt;h2&gt;
  
  
  References / Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitOps definition and principles&lt;/strong&gt; — &lt;a href="https://opengitops.dev/" rel="noopener noreferrer"&gt;OpenGitOps (CNCF)&lt;/a&gt; · &lt;a href="https://about.gitlab.com/topics/gitops/" rel="noopener noreferrer"&gt;GitLab — What is GitOps&lt;/a&gt; · &lt;a href="https://www.gitops.tech/" rel="noopener noreferrer"&gt;gitops.tech&lt;/a&gt; · &lt;a href="https://www.cncf.io/blog/2025/06/09/gitops-in-2025-from-old-school-updates-to-the-modern-way/" rel="noopener noreferrer"&gt;CNCF — GitOps in 2025&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes basics&lt;/strong&gt; — &lt;a href="https://kubernetes.io/docs/concepts/overview/working-with-objects/object-management/" rel="noopener noreferrer"&gt;Object Management&lt;/a&gt; · &lt;a href="https://kubernetes.io/docs/concepts/architecture/controller/" rel="noopener noreferrer"&gt;Controllers&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ArgoCD&lt;/strong&gt; — &lt;a href="https://argo-cd.readthedocs.io/en/stable/" rel="noopener noreferrer"&gt;Official docs&lt;/a&gt; · &lt;a href="https://argo-cd.readthedocs.io/en/stable/operator-manual/cluster-bootstrapping/" rel="noopener noreferrer"&gt;Cluster Bootstrapping (app-of-apps)&lt;/a&gt; · &lt;a href="https://argo-cd.readthedocs.io/en/stable/operator-manual/applicationset/" rel="noopener noreferrer"&gt;ApplicationSet&lt;/a&gt; · &lt;a href="https://argo-cd.readthedocs.io/en/stable/user-guide/best_practices/" rel="noopener noreferrer"&gt;Best Practices&lt;/a&gt; · &lt;a href="https://argo-cd.readthedocs.io/en/stable/user-guide/private-repositories/" rel="noopener noreferrer"&gt;Private Repositories&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flux CD&lt;/strong&gt; — &lt;a href="https://fluxcd.io/" rel="noopener noreferrer"&gt;Official site&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rancher Fleet&lt;/strong&gt; — &lt;a href="https://docs.aws.amazon.com/prescriptive-guidance/latest/eks-gitops-tools/rancher-fleet.html" rel="noopener noreferrer"&gt;AWS Prescriptive Guidance&lt;/a&gt; · &lt;a href="https://fleet.rancher.io/how-tos-for-users/gitrepo-targets" rel="noopener noreferrer"&gt;Fleet — GitRepo Targets&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rendering tools&lt;/strong&gt; — &lt;a href="https://kubernetes.io/docs/tasks/manage-kubernetes-objects/kustomization/" rel="noopener noreferrer"&gt;Kubernetes — Kustomize&lt;/a&gt; · &lt;a href="https://helm.sh/" rel="noopener noreferrer"&gt;Helm&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governance and adoption&lt;/strong&gt; — &lt;a href="https://www.cncf.io/announcements/2022/12/06/the-cloud-native-computing-foundation-announces-argo-has-graduated/" rel="noopener noreferrer"&gt;Argo CNCF Graduated (2022)&lt;/a&gt; · &lt;a href="https://www.cncf.io/announcements/2022/11/30/flux-graduates-from-cncf-incubator/" rel="noopener noreferrer"&gt;Flux CNCF Graduated (2022)&lt;/a&gt; · &lt;a href="https://www.cncf.io/announcements/2025/07/24/cncf-end-user-survey-finds-argo-cd-as-majority-adopted-gitops-solution-for-kubernetes/" rel="noopener noreferrer"&gt;CNCF — ArgoCD end-user survey (2025)&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repository access&lt;/strong&gt; — &lt;a href="https://docs.github.com/en/authentication/connecting-to-github-with-ssh/managing-deploy-keys" rel="noopener noreferrer"&gt;GitHub — Deploy keys&lt;/a&gt; · &lt;a href="https://docs.github.com/en/apps/creating-github-apps/authenticating-with-a-github-app/authenticating-as-a-github-app-installation" rel="noopener noreferrer"&gt;GitHub — App installation auth&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>k3s</category>
      <category>gitops</category>
      <category>argocd</category>
    </item>
    <item>
      <title>Hybrid k3s #4: Building a unified database on k3s — five Postgres operators, and CloudNativePG</title>
      <dc:creator>SEON</dc:creator>
      <pubDate>Tue, 09 Jun 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/seon/hybrid-k3s-4-building-a-unified-database-on-k3s-five-postgres-operators-and-cloudnativepg-2egl</link>
      <guid>https://dev.to/seon/hybrid-k3s-4-building-a-unified-database-on-k3s-five-postgres-operators-and-cloudnativepg-2egl</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.seon.world%2Fimages%2Fk3s-4%2Fk3s-4-master-en.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.seon.world%2Fimages%2Fk3s-4%2Fk3s-4-master-en.png" alt="Hybrid k3s — full architecture" width="800" height="592"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  0. About this series
&lt;/h2&gt;

&lt;p&gt;This series is a record — written one piece at a time — of how I actually built the homelab in the diagram above, the one that's still running as I write this.&lt;/p&gt;

&lt;p&gt;What began as a toy project from a simple "could this even work?" turned, through satisfying performance and endless tearing-down-and-rebuilding, into a genuine toy that takes the edge off the stress built up at work. It isn't a resource-rich cluster, but it's been more than enough to get a real taste of Kubernetes, and it keeps handing me the next thing I want to try.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;6 nodes&lt;/strong&gt; — 2 Lightsail &lt;strong&gt;servers&lt;/strong&gt; (control plane + etcd) in the cloud (AWS Tokyo) + 4 &lt;strong&gt;Lima VM agents&lt;/strong&gt; on a home (Sapporo) iMac&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;19 vCPU / 61 GiB&lt;/strong&gt; total, &lt;strong&gt;49 namespaces&lt;/strong&gt; , &lt;strong&gt;248 pods (150 running)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Deployed with &lt;strong&gt;ArgoCD&lt;/strong&gt; , auth via &lt;strong&gt;Keycloak OIDC&lt;/strong&gt; , with CloudNativePG, Vault, CrowdSec, Prometheus/Grafana and more running on top&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;This time, on top of the six-node hybrid cluster I'd built up through part 3, I &lt;strong&gt;dissect five Operators for running PostgreSQL reliably, and end up building an HA cluster — and its backups — with CloudNativePG.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  1. Kubernetes, databases, and the Operator
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;"Kubernetes supports stateful workloads; I do not." — Kelsey Hightower (2018)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's &lt;a href="https://twitter.com/kelseyhightower/status/969611213079379968" rel="noopener noreferrer"&gt;a one-liner Kelsey Hightower left on X (Twitter) in 2018&lt;/a&gt;, the man widely known as a Kubernetes evangelist. "Kubernetes supports stateful workloads — but I don't," meaning "I wouldn't put a database on it myself." And it wasn't just his opinion: putting databases on Kubernetes was long frowned upon across the infrastructure industry.&lt;/p&gt;

&lt;p&gt;The reasoning is clear. A Pod can go down at any moment (it's ephemeral), and nodes get swapped out without warning. A stateless app can simply be brought back up if it falls over, but a DB that holds data can have its fate decided the moment a single Pod disappears. So "leave the DB to a managed service like RDS or Cloud SQL" was the accepted wisdom for a long time.&lt;/p&gt;

&lt;p&gt;Two things overturned that wisdom. The first was the maturing of &lt;strong&gt;StatefulSet and PersistentVolume&lt;/strong&gt;. Even if a Pod restarts or moves to another node, it can keep the same volume and a stable network ID — which laid the groundwork for stateful workloads. The second was the &lt;strong&gt;Operator pattern&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The Operator is &lt;a href="https://www.redhat.com/en/blog/introducing-operators-putting-operational-knowledge-into-software" rel="noopener noreferrer"&gt;a concept CoreOS introduced in 2016&lt;/a&gt; — in a phrase, "putting operational knowledge into software." It takes the operational work that used to live in an admin's head or in shell scripts — provisioning, version upgrades, failover, backups, point-in-time recovery (PITR) — and moves it into the code of a &lt;strong&gt;controller&lt;/strong&gt; that runs right alongside the workload. You declare only the "desired state" in YAML, and the Operator continuously reconciles it against the current state, converging the two. The first examples were CoreOS's etcd Operator and Prometheus Operator.&lt;/p&gt;

&lt;p&gt;The harder a piece of software is to operate — and a DB is exactly that — the more this pattern pays off. And the PostgreSQL ecosystem is where Operators compete most fiercely. In the next chapter I compare these solutions, each with its own architectural philosophy, one at a time.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Comparing five major Postgres Operator architectures
&lt;/h2&gt;

&lt;p&gt;The major PostgreSQL Operators most widely used today across the CNCF ecosystem and enterprise environments are below. Each has its own architecture and trade-offs, and the right pick shifts with your infrastructure's requirements.&lt;/p&gt;

&lt;p&gt;One caveat: my reason for picking one of these five leans heavily toward "what fits my homelab" and "what I personally wanted more hands-on experience with." Please read it knowing that's a different lens from evaluating them for production use at a company.&lt;/p&gt;

&lt;h3&gt;
  
  
  ① Zalando Postgres Operator
&lt;/h3&gt;

&lt;p&gt;The first one I looked at was the elder statesman of this space, &lt;a href="https://github.com/zalando/postgres-operator" rel="noopener noreferrer"&gt;Zalando Postgres Operator&lt;/a&gt;. Built by the German e-commerce company Zalando for running its own PostgreSQL, and hardened over years of running hundreds of clusters in-house, it's among the oldest Operators around.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The Postgres Operator delivers an easy to run highly-available PostgreSQL clusters on Kubernetes (K8s) powered by Patroni. — Zalando postgres-operator official README&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At its core is a Docker image called &lt;strong&gt;Spilo&lt;/strong&gt;. Spilo bundles PostgreSQL, the HA manager &lt;strong&gt;Patroni&lt;/strong&gt; , and the S3 backup/restore tool &lt;strong&gt;WAL-G&lt;/strong&gt; into a single image; the Operator itself sits on top as a relatively thin control layer that "stands up Spilo Pods once you declare the cluster you want via a CRD." The actual high availability — leader election and automatic failover — is handled by the Patroni inside each Pod, and if your application connections need pooling, you can stand up &lt;strong&gt;PgBouncer separately as a connection pooler&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Let me clear up a common misconception here: that "using Patroni means you need a separate external consensus store (DCS) like etcd or ZooKeeper." &lt;strong&gt;On Kubernetes, that's not the case.&lt;/strong&gt; In Zalando Operator's defaults, Patroni uses &lt;strong&gt;Kubernetes resources themselves (Endpoints, or ConfigMaps) as the DCS&lt;/strong&gt;, and by default the external etcd connection is simply left unset. Patroni takes a leader lock with a TTL (30s by default) on that K8s object and refreshes it periodically; if the leader vanishes, the remaining nodes compare WAL positions and elect a new one.&lt;/p&gt;

&lt;p&gt;Its strength is the sheer weight of precedent and information. Because it's the oldest and most battle-tested in large-scale production, when you hit a problem a search usually turns up a prior case. The license is the permissive MIT, too. That said, this Operator was essentially built for Zalando's own needs, so there's no official commercial support; maintenance continues as of 2026, but the release cadence has visibly slowed compared to the newer CloudNativePG.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8i9bd3xhfdi07dg5uf4c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8i9bd3xhfdi07dg5uf4c.png" alt="Zalando Postgres Operator — Architecture" width="800" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ② CrunchyData PGO
&lt;/h3&gt;

&lt;p&gt;Next is &lt;a href="https://github.com/CrunchyData/postgres-operator" rel="noopener noreferrer"&gt;CrunchyData PGO&lt;/a&gt;. As its GitHub repo describes it — &lt;em&gt;"Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service"&lt;/em&gt; — it was built by a database-focused company, and it shows: it's the Operator that &lt;strong&gt;puts the most weight on "data protection and backup."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;HA itself is the same lineage as Zalando. Inside each Postgres Pod's &lt;code&gt;database&lt;/code&gt; container, &lt;strong&gt;PostgreSQL and Patroni&lt;/strong&gt; run together to handle automatic failover, and the consensus store (DCS) is the &lt;strong&gt;Kubernetes API (Endpoints lease)&lt;/strong&gt; — no external etcd here either. Applications connect to the Primary and Replicas through &lt;strong&gt;PgBouncer&lt;/strong&gt; (a connection pooler) and &lt;strong&gt;Services (rw/ro)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;PGO's real strength is its backups. It integrates &lt;strong&gt;pgBackRest&lt;/strong&gt; — the de facto standard PostgreSQL backup tool — as a &lt;strong&gt;sidecar container&lt;/strong&gt; on each Postgres Pod plus a &lt;strong&gt;dedicated repo host Pod&lt;/strong&gt;. With just &lt;code&gt;spec.backups.pgbackrest&lt;/code&gt; configuration, it archives all transaction logs (WAL) to &lt;strong&gt;up to four storage locations&lt;/strong&gt; (S3, MinIO, GCS, Azure Blob), so even if a whole node is lost, point-in-time recovery (PITR) and disaster recovery (DR) are guaranteed. If you take backup and recovery seriously, it's the most reassuring choice.&lt;/p&gt;

&lt;p&gt;But getting it into the homelab had a &lt;strong&gt;licensing catch&lt;/strong&gt;. PGO's source code is Apache 2.0, but the &lt;strong&gt;production container images are bound by the Crunchy Data Developer Program terms&lt;/strong&gt; , and using those images in production effectively requires a commercial agreement. It's fine for personal learning or a homelab, but measured against "can I extend this to in-house or commercial use whenever I want," it was an uneasy constraint.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2cc1pd8ugm07dvflux71.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2cc1pd8ugm07dvflux71.png" alt="CrunchyData PGO — Architecture" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ③ Percona Operator for PostgreSQL
&lt;/h3&gt;

&lt;p&gt;The third is &lt;a href="https://github.com/percona/percona-postgresql-operator" rel="noopener noreferrer"&gt;Percona Operator for PostgreSQL&lt;/a&gt;. Built by Percona, which has run an open-source DB business for over 18 years, it's the choice where "fully open source" comes through most clearly.&lt;/p&gt;

&lt;p&gt;Its architecture is rooted in the CrunchyData PGO we just saw. Percona &lt;strong&gt;hard-forked&lt;/strong&gt; PGO (becoming a fully independent project from 3.0.0 onward) and grew it from there. So high availability is again handled by &lt;strong&gt;Patroni&lt;/strong&gt; , the consensus store by the &lt;strong&gt;Kubernetes API (Endpoints lease)&lt;/strong&gt;, backups by &lt;strong&gt;pgBackRest&lt;/strong&gt; , and connection pooling by &lt;strong&gt;PgBouncer&lt;/strong&gt; — inheriting PGO's proven skeleton as-is.&lt;/p&gt;

&lt;p&gt;Two things set Percona apart. One is &lt;strong&gt;PMM (Percona Monitoring and Management) integration&lt;/strong&gt;. A &lt;strong&gt;PMM Client sidecar&lt;/strong&gt; attaches to each Postgres Pod and ships query analytics (QAN), system metrics, and even &lt;strong&gt;Patroni's metrics&lt;/strong&gt; to the PMM Server. Production-grade observability comes along without much extra setup.&lt;/p&gt;

&lt;p&gt;The other was the clincher: &lt;strong&gt;the container images are fully open source (Apache 2.0), with no usage restrictions.&lt;/strong&gt; That's the exact opposite of CrunchyData requiring a commercial agreement for production images. Percona itself markets this point as &lt;em&gt;"Migrate to Freedom."&lt;/em&gt; Thinking about starting in a homelab and possibly extending to in-house or commercial use someday, that "no restrictions" was a big draw.&lt;/p&gt;

&lt;p&gt;The single reason it still fell out of the final cut was &lt;strong&gt;weight&lt;/strong&gt;. Because it carries PGO's lineage, each Pod gets several containers (database, pgBackRest, PMM), and you also have to stand up a PMM Server separately. It's a reasonable setup for an enterprise, but for my homelab splitting 19 vCPU, it was a touch heavy.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faskxatnzqp76x87cufkm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faskxatnzqp76x87cufkm.png" alt="Percona Operator for PostgreSQL — Architecture" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ④ StackGres
&lt;/h3&gt;

&lt;p&gt;The fourth is &lt;a href="https://gitlab.com/ongresinc/stackgres" rel="noopener noreferrer"&gt;StackGres&lt;/a&gt;. Built by Spain's OnGres, this Operator reaches beyond a mere HA tool, billing itself as &lt;strong&gt;"a complete PostgreSQL platform (DBaaS) on Kubernetes."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The HA foundation is again &lt;strong&gt;Patroni&lt;/strong&gt; , the consensus store the &lt;strong&gt;Kubernetes API&lt;/strong&gt; (no external etcd) — same as above so far. What sets StackGres apart is its &lt;strong&gt;"pack everything into one Pod (batteries-included)"&lt;/strong&gt; design. Inside a single Postgres Pod run, alongside PostgreSQL+Patroni, an &lt;strong&gt;Envoy proxy&lt;/strong&gt; (mandatory) that handles all traffic, the &lt;strong&gt;PgBouncer&lt;/strong&gt; connection pooler, a &lt;strong&gt;postgres-exporter&lt;/strong&gt; for metrics, &lt;strong&gt;fluent-bit&lt;/strong&gt; for logs, and a &lt;strong&gt;cluster-controller&lt;/strong&gt; that reconciles local state — several containers together. The Envoy here isn't just a proxy; it parses the Postgres wire protocol and even produces connection statistics.&lt;/p&gt;

&lt;p&gt;The operational experience is well thought out, too. A &lt;strong&gt;Web Console and REST API&lt;/strong&gt; are built in by default, so nearly everything you'd do with &lt;code&gt;kubectl&lt;/code&gt; can be handled from a UI instead. Backups go through the &lt;code&gt;SGBackup&lt;/code&gt; and &lt;code&gt;SGObjectStorage&lt;/code&gt; CRDs, with continuous archiving (base backup + WAL) to S3, MinIO, GCS, or Azure. True to "batteries included," it's the friendliest all-in-one for someone just getting started.&lt;/p&gt;

&lt;p&gt;The problem was the &lt;strong&gt;license&lt;/strong&gt;. StackGres's core code is &lt;strong&gt;AGPL 3.0&lt;/strong&gt;. Using it as-is is fine, but the moment you put something on top and offer it as a service, the source-disclosure (copyleft) obligation can reach your own code. It isn't a problem right now, but thinking about the homelab-expansion scenario of "I might put my own service on this cluster and expose it externally," AGPL was a concern I'd rather avoid up front.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxyt6yk0y31o24a6f8hlv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxyt6yk0y31o24a6f8hlv.png" alt="StackGres — Architecture" width="800" height="602"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ⑤ CloudNativePG (CNPG)
&lt;/h3&gt;

&lt;p&gt;Last is &lt;a href="https://github.com/cloudnative-pg/cloudnative-pg" rel="noopener noreferrer"&gt;CloudNativePG&lt;/a&gt; (CNPG). The latest arrival, only showing up in 2022, yet in just two years it overtook Zalando and CrunchyData by GitHub stars to become &lt;strong&gt;the most popular PostgreSQL Operator today.&lt;/strong&gt; Built by EDB (EnterpriseDB), donated to the &lt;strong&gt;CNCF Sandbox&lt;/strong&gt; in January 2025, licensed Apache 2.0.&lt;/p&gt;

&lt;p&gt;The secret to CNPG vaulting to the front was, paradoxically, &lt;strong&gt;"subtraction."&lt;/strong&gt; It &lt;strong&gt;strips out Patroni entirely&lt;/strong&gt; — the thing the previous four shared — &lt;strong&gt;and uses no external DCS.&lt;/strong&gt; Instead, each Pod's &lt;strong&gt;Instance Manager&lt;/strong&gt; (a Go process running as PID 1) directly controls the PostgreSQL native binary, and the &lt;strong&gt;Kubernetes API (Endpoint Leases)&lt;/strong&gt; is the single source of truth for state. Deciding the leader (primary) and failover is handled directly by the &lt;strong&gt;Operator (controller-manager)&lt;/strong&gt; through reconciliation.&lt;/p&gt;

&lt;p&gt;The result is &lt;strong&gt;extreme simplicity.&lt;/strong&gt; Inside one Pod there's no Patroni, no Envoy, no pile of sidecars — just the Instance Manager and PostgreSQL. That means low overhead and easy debugging. Backups use the built-in &lt;strong&gt;Barman Cloud&lt;/strong&gt; , continuously archiving WAL to S3-compatible storage for PITR, and connections are cleanly split across &lt;code&gt;rw&lt;/code&gt;, &lt;code&gt;ro&lt;/code&gt;, and &lt;code&gt;r&lt;/code&gt; Services.&lt;/p&gt;

&lt;p&gt;Governance is reassuring, too. &lt;strong&gt;Being a CNCF project, no single company can quietly shut it down&lt;/strong&gt; , multiple vendors offer commercial support, and development moves fastest in this space. In the next chapter I'll line up the five in one table, then lay out why it ended up being CNPG.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo87ztr6fzn7i8u6s9nrz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo87ztr6fzn7i8u6s9nrz.png" alt="CloudNativePG — Architecture" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  A quick summary table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operator&lt;/th&gt;
&lt;th&gt;HA engine&lt;/th&gt;
&lt;th&gt;Consensus store (DCS)&lt;/th&gt;
&lt;th&gt;Pod composition&lt;/th&gt;
&lt;th&gt;Backup&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;th&gt;Governance · activity&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;① &lt;strong&gt;Zalando&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Patroni&lt;/td&gt;
&lt;td&gt;K8s API (Endpoints/ConfigMaps)&lt;/td&gt;
&lt;td&gt;Spilo (PG+Patroni+WAL-G), separate Pooler&lt;/td&gt;
&lt;td&gt;WAL-G → S3&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;in-house · slowing releases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;② &lt;strong&gt;CrunchyData&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Patroni&lt;/td&gt;
&lt;td&gt;K8s API (Endpoints lease)&lt;/td&gt;
&lt;td&gt;database + pgBackRest sidecar + repo host&lt;/td&gt;
&lt;td&gt;pgBackRest (up to 4 repos)&lt;/td&gt;
&lt;td&gt;Apache 2.0 · &lt;strong&gt;restricted prod images&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;OSS effectively commercialized&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;③ &lt;strong&gt;Percona&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Patroni&lt;/td&gt;
&lt;td&gt;K8s API (Endpoints lease)&lt;/td&gt;
&lt;td&gt;database + pgBackRest + &lt;strong&gt;PMM&lt;/strong&gt; sidecar&lt;/td&gt;
&lt;td&gt;pgBackRest&lt;/td&gt;
&lt;td&gt;Apache 2.0 · &lt;strong&gt;fully open (no restrictions)&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;stable company · active&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;④ &lt;strong&gt;StackGres&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Patroni&lt;/td&gt;
&lt;td&gt;K8s API&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;batteries&lt;/strong&gt; (Envoy·PgBouncer·exporter·fluent-bit·controller)&lt;/td&gt;
&lt;td&gt;SGBackup · continuous archiving&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;AGPL 3.0&lt;/strong&gt; (copyleft)&lt;/td&gt;
&lt;td&gt;OnGres flagship · active&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;⑤ &lt;strong&gt;CloudNativePG&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;own Instance Manager&lt;/strong&gt; (no Patroni)&lt;/td&gt;
&lt;td&gt;K8s API (Endpoint Leases)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Instance Manager + PostgreSQL&lt;/strong&gt; (two)&lt;/td&gt;
&lt;td&gt;Barman Cloud&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;CNCF · #1 today · most active&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Lined up, the differences are clear. &lt;strong&gt;①–④ all use Patroni&lt;/strong&gt; — the real difference is "what you stack on top of that Pod (backup, monitoring, proxy)" and "the license." And &lt;strong&gt;only ⑤ CNPG drops Patroni&lt;/strong&gt; and adopts its own Instance Manager. &lt;strong&gt;The consensus store is the Kubernetes API for all five&lt;/strong&gt; — the common myth that "using Patroni requires external etcd" is, on Kubernetes, no longer true.&lt;/p&gt;

&lt;p&gt;My homelab's criteria were clear. Splitting 19 vCPU, it had to be &lt;strong&gt;lightweight&lt;/strong&gt; ; extending it to in-house or commercial use someday, it had to be &lt;strong&gt;free of license strings&lt;/strong&gt; ; and running it long-term, it needed &lt;strong&gt;governance that won't fade&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Why I chose CloudNativePG
&lt;/h2&gt;

&lt;p&gt;After comparing all those Operators, the solution that best fit my current homelab (k3s-based) and my plans for it was &lt;strong&gt;CloudNativePG (CNPG)&lt;/strong&gt;. Three things clinched it.&lt;/p&gt;

&lt;h3&gt;
  
  
  ① Extreme simplicity and lightness (Kubernetes Native)
&lt;/h3&gt;

&lt;p&gt;By nature, a homelab's resources (CPU/memory) aren't as plentiful as an enterprise's. Standing up Patroni or an external DCS (etcd, ZooKeeper), like the other Operators, adds the heavy burden of managing that component's own state on top. CNPG removes external dependencies entirely and &lt;strong&gt;uses the Kubernetes API Server itself as the DCS.&lt;/strong&gt; Without interposing a separate HA process like Patroni, only PostgreSQL and the &lt;strong&gt;Instance Manager (a Go process)&lt;/strong&gt; that wraps and manages it as PID 1 are running — so overhead is minimal, the structure is intuitive, and debugging is far easier.&lt;/p&gt;

&lt;h3&gt;
  
  
  ② Rock-solid backup/recovery with Barman (PITR)
&lt;/h3&gt;

&lt;p&gt;The most important thing for a database is, above all, backups. CNPG embeds the cloud edition of &lt;strong&gt;Barman (Backup and Recovery Manager)&lt;/strong&gt;, the de facto standard for PostgreSQL backups. With just a few lines of config, it continuously archives every transaction log (WAL) to S3-compatible storage (in my case, MinIO built inside the homelab, or Cloudflare R2). Even if a node or disk physically dies, you can roll the data back to a past point in time (Point-In-Time Recovery) up to the last archived moment. CNPG sets the default &lt;code&gt;archive_timeout&lt;/code&gt; to 5 minutes, guaranteeing a clear &lt;strong&gt;5-minute RPO (Recovery Point Objective)&lt;/strong&gt;; you can shrink that interval further with synchronous replication or a shorter archive interval.&lt;/p&gt;

&lt;h3&gt;
  
  
  ③ A declarative CRD — the ideal candidate for GitOps next time
&lt;/h3&gt;

&lt;p&gt;CNPG's &lt;code&gt;Cluster&lt;/code&gt; CRD is thoroughly declarative. The PostgreSQL version, instance count, resource limits, storage, backup policy — the cluster's entire state fits in a single YAML.&lt;/p&gt;

&lt;p&gt;This time I install the Operator with &lt;code&gt;helm&lt;/code&gt; and stand the cluster up directly with &lt;code&gt;kubectl apply&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The fact that the whole cluster state fits in YAML means that definition can go straight into Git and be run declaratively. And &lt;strong&gt;next time I plan to GitOps the whole homelab, including this cluster, with ArgoCD&lt;/strong&gt; — CNPG, which expresses everything as a single CRD, is the best-suited choice for that move.&lt;/p&gt;

&lt;p&gt;A YAML on Git becomes the cluster state as-is; commit a change and ArgoCD syncs it, applying it through a zero-downtime rolling update. Having decided to bring the database inside Kubernetes, only when its definition is managed as code, too, does it stop feeling half-finished.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgfuzo1gfd51skh6tbkzb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgfuzo1gfd51skh6tbkzb.png" alt="Homelab CloudNativePG architecture (k3s-based)" width="800" height="489"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Hands-on: installing the CNPG Operator
&lt;/h2&gt;

&lt;p&gt;With CNPG, you deploy the Operator once cluster-wide with Helm, and from then on you can create a &lt;code&gt;Cluster&lt;/code&gt; in any namespace. Add the official chart repo and install into the &lt;code&gt;cnpg-system&lt;/code&gt; namespace.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm repo add cnpg https://cloudnative-pg.github.io/charts
helm repo update
helm upgrade &lt;span class="nt"&gt;--install&lt;/span&gt; cnpg cnpg/cloudnative-pg &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; cnpg-system &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--wait&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;The CNPG chart only ever offers "the latest point release." To pin a specific version, add &lt;code&gt;--version &amp;lt;chart-version&amp;gt;&lt;/code&gt;. (This article is based on Operator &lt;strong&gt;v1.24&lt;/strong&gt;.)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once installed, check that the Operator Pod (Deployment name &lt;code&gt;cnpg-controller-manager&lt;/code&gt;) is &lt;code&gt;Running&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; cnpg-system
&lt;span class="c"&gt;# NAME READY STATUS RESTARTS AGE&lt;/span&gt;
&lt;span class="c"&gt;# cnpg-controller-manager-8d447b4b6-xxxxx 1/1 Running 0 40s&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All CNPG operations are done through custom resources (CRDs). Check that the main CRDs are registered.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get crd | &lt;span class="nb"&gt;grep &lt;/span&gt;postgresql.cnpg.io
&lt;span class="c"&gt;# backups.postgresql.cnpg.io&lt;/span&gt;
&lt;span class="c"&gt;# clusterimagecatalogs.postgresql.cnpg.io&lt;/span&gt;
&lt;span class="c"&gt;# clusters.postgresql.cnpg.io&lt;/span&gt;
&lt;span class="c"&gt;# imagecatalogs.postgresql.cnpg.io&lt;/span&gt;
&lt;span class="c"&gt;# poolers.postgresql.cnpg.io&lt;/span&gt;
&lt;span class="c"&gt;# scheduledbackups.postgresql.cnpg.io&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, install the &lt;strong&gt;&lt;code&gt;cnpg&lt;/code&gt; kubectl plugin&lt;/strong&gt; you'll use later to inspect cluster state and verify connections. With &lt;a href="https://krew.sigs.k8s.io/" rel="noopener noreferrer"&gt;krew&lt;/a&gt; it's one line.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl krew &lt;span class="nb"&gt;install &lt;/span&gt;cnpg
kubectl cnpg version
&lt;span class="c"&gt;# Build: {Version:1.24.1 ...}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Without krew, you can also install it via CNPG's &lt;a href="https://cloudnative-pg.io/documentation/current/kubectl-plugin/" rel="noopener noreferrer"&gt;official install script&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The Operator and plugin are ready. Now it's time to stand up an actual database cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Hands-on: deploying the first highly available PostgreSQL cluster
&lt;/h2&gt;

&lt;p&gt;With the Operator ready, let's stand up a PostgreSQL cluster that actually holds data. CNPG packs the cluster's entire state — instance count, PostgreSQL version, resources, storage — into a single &lt;code&gt;Cluster&lt;/code&gt; CRD.&lt;/p&gt;

&lt;p&gt;Below is the manifest (&lt;code&gt;demo-db.yaml&lt;/code&gt;) for a 3-node HA cluster I made for verification. Since it's for testing, resources are kept small (adjust as needed), and storage uses k3s's default &lt;code&gt;local-path&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgresql.cnpg.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Cluster&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;demo-db&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cnpg-demo&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;instances&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt; &lt;span class="c1"&gt;# 1 Primary + 2 Replica&lt;/span&gt;
  &lt;span class="na"&gt;imageName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/cloudnative-pg/postgresql:16.4&lt;/span&gt;

  &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1Gi&lt;/span&gt;
    &lt;span class="na"&gt;storageClass&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local-path&lt;/span&gt; &lt;span class="c1"&gt;# k3s default (WaitForFirstConsumer)&lt;/span&gt;

  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;256Mi"&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;100m"&lt;/span&gt;
    &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;512Mi"&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;500m"&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create the namespace and apply.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create namespace cnpg-demo
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; demo-db.yaml

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Operator first bootstraps the Primary with &lt;code&gt;initdb&lt;/code&gt;, then joins the Replicas one by one. (When pulling the PostgreSQL image for the first time on a home node, the initial startup can take a few minutes.) Once the status reads &lt;code&gt;Cluster in healthy state&lt;/code&gt;, it's done.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get cluster demo-db &lt;span class="nt"&gt;-n&lt;/span&gt; cnpg-demo
&lt;span class="c"&gt;# NAME AGE INSTANCES READY STATUS PRIMARY&lt;/span&gt;
&lt;span class="c"&gt;# demo-db 24m 3 3 Cluster in healthy state demo-db-1&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Pods consist of one Primary and two Replicas, and thanks to CNPG's default anti-affinity, they spread across separate nodes as much as possible. In my homelab, the three Pods landed on three different Lima nodes (agent-2/3/4).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; cnpg-demo
&lt;span class="c"&gt;# NAME READY STATUS RESTARTS AGE&lt;/span&gt;
&lt;span class="c"&gt;# demo-db-1 1/1 Running 0 13m&lt;/span&gt;
&lt;span class="c"&gt;# demo-db-2 1/1 Running 0 6m&lt;/span&gt;
&lt;span class="c"&gt;# demo-db-3 1/1 Running 0 3m&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Operator also creates three Services for connecting. Writes always go to the Primary via &lt;code&gt;*-rw&lt;/code&gt;, reads spread across the Replicas via &lt;code&gt;*-ro&lt;/code&gt;, and &lt;code&gt;*-r&lt;/code&gt; includes both the Primary and Replicas.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get svc &lt;span class="nt"&gt;-n&lt;/span&gt; cnpg-demo
&lt;span class="c"&gt;# NAME TYPE CLUSTER-IP PORT(S) AGE&lt;/span&gt;
&lt;span class="c"&gt;# demo-db-r ClusterIP 10.43.89.62 5432/TCP 27m&lt;/span&gt;
&lt;span class="c"&gt;# demo-db-ro ClusterIP 10.43.198.255 5432/TCP 27m&lt;/span&gt;
&lt;span class="c"&gt;# demo-db-rw ClusterIP 10.43.55.136 5432/TCP 27m&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  6. Hands-on: verifying the connection and replication state
&lt;/h2&gt;

&lt;p&gt;With the cluster up, let's actually connect and confirm it works. On cluster creation, CNPG auto-generates an &lt;strong&gt;application Secret (&lt;code&gt;&amp;lt;cluster&amp;gt;-app&lt;/code&gt;)&lt;/strong&gt;. It holds the username, DB name, password, and a ready-to-use connection URI. (Admin/superuser access is &lt;strong&gt;disabled by default&lt;/strong&gt; , enabled when needed via &lt;code&gt;spec.enableSuperuserAccess: true&lt;/code&gt; — so there's no &lt;code&gt;*-superuser&lt;/code&gt; Secret in the default setup.)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get secret demo-db-app &lt;span class="nt"&gt;-n&lt;/span&gt; cnpg-demo &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.data.username}'&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt;
&lt;span class="c"&gt;# app&lt;/span&gt;
kubectl get secret demo-db-app &lt;span class="nt"&gt;-n&lt;/span&gt; cnpg-demo &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.data.dbname}'&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt;
&lt;span class="c"&gt;# app&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The easiest way to connect is the &lt;strong&gt;&lt;code&gt;cnpg&lt;/code&gt; plugin&lt;/strong&gt; from §4. Open a &lt;code&gt;psql&lt;/code&gt; session straight to the Primary.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl cnpg psql demo-db &lt;span class="nt"&gt;-n&lt;/span&gt; cnpg-demo

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the session, check the replication state. The Primary should be streaming WAL to the two Replicas.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;postgres&lt;/span&gt;&lt;span class="o"&gt;=#&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;application_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;client_addr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sync_state&lt;/span&gt;
           &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_replication&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;application_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

 &lt;span class="n"&gt;application_name&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;client_addr&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="k"&gt;state&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;sync_state&lt;/span&gt;
&lt;span class="c1"&gt;------------------+-------------+-----------+------------&lt;/span&gt;
 &lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;111&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;streaming&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;async&lt;/span&gt;
 &lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;236&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;streaming&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;async&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both Replicas are &lt;code&gt;streaming&lt;/code&gt;. With no Patroni and no external DCS — just the Kubernetes API and each Pod's Instance Manager — a leader was elected and streaming replication was set up. The default replication is asynchronous (&lt;code&gt;async&lt;/code&gt;), so write latency is negligible; if you need stronger guarantees, you can enable synchronous replication (&lt;code&gt;minSyncReplicas&lt;/code&gt;/&lt;code&gt;maxSyncReplicas&lt;/code&gt;) in the &lt;code&gt;spec&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;With that, a lightweight, simple, highly available PostgreSQL cluster is running on the homelab's k3s. One last thing remains — how to keep this data safe: adding backups.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. The last piece that protects your data — backups to MinIO
&lt;/h2&gt;

&lt;p&gt;Even with the cluster up, without backups you're only halfway there. When a whole node dies, or you drop a table by accident, only a &lt;strong&gt;base backup + WAL archive&lt;/strong&gt; lets you roll back to a point in time (PITR).&lt;/p&gt;

&lt;h3&gt;
  
  
  S3-compatible API — any backend works
&lt;/h3&gt;

&lt;p&gt;CNPG's backup engine, &lt;strong&gt;Barman Cloud&lt;/strong&gt; , uploads backups to object storage via the &lt;strong&gt;S3-compatible API&lt;/strong&gt;. The key point here is that "S3-compatible" is a single standard interface. So the backend can be &lt;strong&gt;AWS S3, Google Cloud Storage, Azure Blob, Cloudflare R2, or a self-hosted MinIO&lt;/strong&gt; — anything; you just change the endpoint and credentials in the config and it works as-is.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkrihkggif5sfv8di5sxo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkrihkggif5sfv8di5sxo.png" alt="CNPG backup — S3-compatible object storage (MinIO)" width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What I chose this time is &lt;strong&gt;MinIO&lt;/strong&gt;. MinIO is open-source object storage that implements the S3 API directly, and you can stand it up right inside the cluster. I had three reasons for picking MinIO over a managed S3 service:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data sovereignty&lt;/strong&gt; : backups never leave the homelab — not one step.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero cost&lt;/strong&gt; : no cloud storage or transfer charges.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simple access&lt;/strong&gt; : it connects directly to &lt;code&gt;minio.minio:9000&lt;/code&gt; within the same k3s.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If I ever need off-site backups, I just change the endpoint to S3 or R2 in the same config and swap the credentials. That near-zero cost of swapping backends is the strength of S3-compatibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  Adding the backup config
&lt;/h3&gt;

&lt;p&gt;First, place the MinIO connection credentials as a Secret &lt;strong&gt;in the same namespace as the cluster&lt;/strong&gt; (the key names are arbitrary; here, &lt;code&gt;ACCESS_KEY&lt;/code&gt;/&lt;code&gt;SECRET_KEY&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; cnpg-demo create secret generic minio-backup-creds &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;ACCESS_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'&amp;lt;minio-access-key&amp;gt;'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;SECRET_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'&amp;lt;minio-secret-key&amp;gt;'&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then add &lt;code&gt;spec.backup.barmanObjectStore&lt;/code&gt; to the &lt;code&gt;Cluster&lt;/code&gt;. Point it at the bucket path (&lt;code&gt;destinationPath&lt;/code&gt;), the MinIO endpoint (&lt;code&gt;endpointURL&lt;/code&gt;), and the Secret you just made.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backup&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;barmanObjectStore&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;destinationPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;s3://cnpg-demo/&lt;/span&gt; &lt;span class="c1"&gt;# MinIO bucket&lt;/span&gt;
      &lt;span class="na"&gt;endpointURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://minio.minio:9000&lt;/span&gt; &lt;span class="c1"&gt;# MinIO inside the cluster&lt;/span&gt;
      &lt;span class="na"&gt;s3Credentials&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;accessKeyId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;minio-backup-creds&lt;/span&gt;
          &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ACCESS_KEY&lt;/span&gt;
        &lt;span class="na"&gt;secretAccessKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;minio-backup-creds&lt;/span&gt;
          &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SECRET_KEY&lt;/span&gt;
      &lt;span class="na"&gt;wal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;compression&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gzip&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once applied, the Operator takes over PostgreSQL's &lt;code&gt;archive_command&lt;/code&gt; and begins &lt;strong&gt;continuous WAL archiving&lt;/strong&gt;. When the cluster's &lt;code&gt;ContinuousArchiving&lt;/code&gt; condition turns true, it's ready.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get cluster demo-db &lt;span class="nt"&gt;-n&lt;/span&gt; cnpg-demo &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.status.conditions[?(@.type=="ContinuousArchiving")].status}'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt;
&lt;span class="c"&gt;# True&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  First backup and verification
&lt;/h3&gt;

&lt;p&gt;WAL keeps flowing, but you need to take a &lt;strong&gt;base backup&lt;/strong&gt; once as the reference point for recovery. A single &lt;code&gt;Backup&lt;/code&gt; resource does it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgresql.cnpg.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Backup&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;demo-db-backup-1&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cnpg-demo&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;barmanObjectStore&lt;/span&gt;
  &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;demo-db&lt;/span&gt;


&lt;span class="s"&gt;kubectl apply -f backup.yaml&lt;/span&gt;
&lt;span class="s"&gt;kubectl get backup -n cnpg-demo&lt;/span&gt;
&lt;span class="c1"&gt;# NAME CLUSTER METHOD PHASE ERROR&lt;/span&gt;
&lt;span class="c1"&gt;# demo-db-backup-1 demo-db barmanObjectStore completed&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;completed&lt;/code&gt;. Peeking into the MinIO bucket, the base backup sits under a timestamped directory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# inside MinIO bucket cnpg-demo
demo-db/
└── base/
    └── 20260609T094230/ # base backup (WAL accumulates under wals/ from here)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now this cluster can — even if a node dies — roll back to any point in time using the base backup and WAL stacked in MinIO. Having brought the data inside Kubernetes, we've now also prepared the safety net that protects it, all in the same declarative way.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Wrapping up — and what's next
&lt;/h2&gt;

&lt;p&gt;I'd steer clear of spanning a CNPG cluster across nodes that sit on opposite sides of Tailscale.&lt;/p&gt;

&lt;p&gt;CNPG is sensitive to inter-node network quality. The Primary constantly streams WAL to the Replicas, and each Pod's Instance Manager updates its state (lease) to the Kubernetes API on a short cycle. But this homelab's inter-node communication is a doubly encapsulated structure — flannel VXLAN layered again on top of a Tailscale (WireGuard) mesh — and the cloud (Tokyo)–home (Sapporo) leg is effectively a WAN. Placing the Primary and Replicas across that leg piles increased RTT and jitter onto a shrunken MTU, and WAL streaming breaks while leases expire. In fact, when I built a cluster spanning the Lightsail and Lima nodes, latency-driven connection drops and Pod restarts repeated endlessly.&lt;/p&gt;

&lt;p&gt;So I strongly recommend keeping a CNPG cluster within the same low-latency leg, not crossing Tailscale. Pin the cluster to a single site (cloud nodes together, or home nodes together) with &lt;code&gt;nodeSelector&lt;/code&gt; or &lt;code&gt;affinity&lt;/code&gt;, and inter-node communication stays LAN-stable. If you need redundancy across sites, a per-site cluster plus an asynchronous Replica Cluster is safer than spreading a single cluster across the WAN.&lt;/p&gt;

&lt;p&gt;Looking back, CNPG's appeal came down to &lt;strong&gt;simplicity&lt;/strong&gt;. Decide the leader with the Kubernetes API alone — no Patroni, no external DCS — and hand backups off to S3-compatible storage. Light, free (Apache 2.0, CNCF), and above all, everything from the cluster to the backup policy declared in a single YAML — that fit my 19-vCPU homelab, and what comes after it, perfectly.&lt;/p&gt;

&lt;p&gt;And that "single YAML" is exactly the starting point for next time. This time I stood it up &lt;strong&gt;imperatively&lt;/strong&gt; with &lt;code&gt;helm&lt;/code&gt; and &lt;code&gt;kubectl apply&lt;/code&gt;, but the cluster and its backups are, in the end, declarative manifests. Next time I'll &lt;strong&gt;GitOps the whole homelab, including this cluster, with ArgoCD&lt;/strong&gt; — moving toward the picture where a YAML on Git is the cluster state itself, and a single commit becomes a deploy.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CloudNativePG&lt;/strong&gt; — &lt;a href="https://cloudnative-pg.io/documentation/" rel="noopener noreferrer"&gt;official documentation&lt;/a&gt; · &lt;a href="https://github.com/cloudnative-pg/cloudnative-pg" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zalando Postgres Operator&lt;/strong&gt; — &lt;a href="https://github.com/zalando/postgres-operator" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://patroni.readthedocs.io/" rel="noopener noreferrer"&gt;Patroni docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CrunchyData PGO&lt;/strong&gt; — &lt;a href="https://github.com/CrunchyData/postgres-operator" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Percona Operator for PostgreSQL&lt;/strong&gt; — &lt;a href="https://github.com/percona/percona-postgresql-operator" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;StackGres (OnGres)&lt;/strong&gt; — &lt;a href="https://stackgres.io/" rel="noopener noreferrer"&gt;official site&lt;/a&gt; · &lt;a href="https://gitlab.com/ongresinc/stackgres" rel="noopener noreferrer"&gt;GitLab&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operator pattern (CoreOS, 2016)&lt;/strong&gt; — &lt;a href="https://www.redhat.com/en/blog/introducing-operators-putting-operational-knowledge-into-software" rel="noopener noreferrer"&gt;Introducing Operators&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comparing PostgreSQL Operators&lt;/strong&gt; — &lt;a href="https://palark.com/blog/comparing-kubernetes-operators-for-postgresql/" rel="noopener noreferrer"&gt;Palark&lt;/a&gt; · &lt;a href="https://simplyblock.io/blog/choosing-a-kubernetes-postgres-operator/" rel="noopener noreferrer"&gt;simplyblock&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>Hybrid k3s #3: Pods couldn't talk to each other — flannel VXLAN and vmnet</title>
      <dc:creator>SEON</dc:creator>
      <pubDate>Sat, 06 Jun 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/seon/hybrid-k3s-3-pods-couldnt-talk-to-each-other-flannel-vxlan-and-vmnet-1d42</link>
      <guid>https://dev.to/seon/hybrid-k3s-3-pods-couldnt-talk-to-each-other-flannel-vxlan-and-vmnet-1d42</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.seon.world%2Fimages%2Fk3s-3%2Fk3s-3-master-en.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.seon.world%2Fimages%2Fk3s-3%2Fk3s-3-master-en.png" alt="Hybrid k3s — full architecture" width="800" height="592"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  0. About this series
&lt;/h2&gt;

&lt;p&gt;This series is a record — written one piece at a time — of how I actually built the homelab in the diagram above, the one that's still running as I write this.&lt;/p&gt;

&lt;p&gt;What began as a toy project from a simple "could this even work?" turned, through satisfying performance and endless tearing-down-and-rebuilding, into a genuine toy that takes the edge off the stress built up at work. It isn't a resource-rich cluster, but it's been more than enough to get a real taste of Kubernetes, and it keeps handing me the next thing I want to try.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;6 nodes&lt;/strong&gt; — 2 Lightsail &lt;strong&gt;servers&lt;/strong&gt; (control plane + etcd) in the cloud (AWS Tokyo) + 4 &lt;strong&gt;Lima VM agents&lt;/strong&gt; on a home (Sapporo) iMac&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;19 vCPU / 61 GiB&lt;/strong&gt; total, &lt;strong&gt;49 namespaces&lt;/strong&gt; , &lt;strong&gt;248 pods (150 running)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Deployed with &lt;strong&gt;ArgoCD&lt;/strong&gt; , auth via &lt;strong&gt;Keycloak OIDC&lt;/strong&gt; , with CloudNativePG, Vault, CrowdSec, Prometheus/Grafana and more running on top&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;In #1 I stood up two cloud control-plane nodes, and in #2 I welcomed the home iMac as four Lima VM agents — and &lt;strong&gt;all six nodes went &lt;code&gt;Ready&lt;/code&gt;.&lt;/strong&gt; This article is about what came next: the nodes were all connected, yet &lt;strong&gt;the Pods themselves couldn't talk across nodes.&lt;/strong&gt; I peer into flannel, follow where the packets actually go, and end up binding the VMs on the same iMac directly with &lt;strong&gt;vmnet.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  1. Six nodes &lt;code&gt;Ready&lt;/code&gt;, yet the Pods were strangers
&lt;/h2&gt;

&lt;p&gt;The picture I left off on last time looked clean. Running &lt;code&gt;kubectl get nodes&lt;/code&gt;, the two Tokyo servers and four Sapporo Lima agents were all &lt;code&gt;Ready&lt;/code&gt;, with a Tailscale &lt;code&gt;100.x&lt;/code&gt; address in INTERNAL-IP.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl get nodes &lt;span class="nt"&gt;-o&lt;/span&gt; wide
&lt;span class="go"&gt;NAME STATUS ROLES AGE VERSION INTERNAL-IP OS-IMAGE CONTAINER-RUNTIME
ip-172-26-2-70… Ready control-plane,etcd 143d v1.34.3+k3s1 100.99.x.x Amazon Linux 2023 containerd://2.1.5-k3s1
ip-172-26-3-146… Ready control-plane,etcd 143d v1.34.3+k3s1 100.71.x.x Amazon Linux 2023 containerd://2.1.5-k3s1
&lt;/span&gt;&lt;span class="gp"&gt;lima-k3s-agent Ready &amp;lt;none&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;143d v1.34.3+k3s1 100.84.x.x Ubuntu 24.04.3 LTS containerd://2.1.5-k3s1
&lt;span class="gp"&gt;lima-k3s-agent-2 Ready &amp;lt;none&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;143d v1.34.3+k3s1 100.98.x.x Ubuntu 24.04.3 LTS containerd://2.1.5-k3s1
&lt;span class="gp"&gt;lima-k3s-agent-3 Ready &amp;lt;none&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;143d v1.34.3+k3s1 100.117.x.x Ubuntu 24.04.3 LTS containerd://2.1.5-k3s1
&lt;span class="gp"&gt;lima-k3s-agent-4 Ready &amp;lt;none&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;143d v1.34.3+k3s1 100.90.x.x Ubuntu 25.10 containerd://2.1.5-k3s1
&lt;span class="go"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I thought I was done. So I eagerly started piling Pods on — and almost immediately hit a strange wall. &lt;strong&gt;Pods on the same node talked just fine, but Pods on different nodes couldn't reach each other.&lt;/strong&gt; Services timed out in odd places, and some Pods couldn't even resolve DNS.&lt;/p&gt;

&lt;p&gt;At first I thought, "every node is &lt;code&gt;Ready&lt;/code&gt; — so why?" That was a misread. &lt;code&gt;Ready&lt;/code&gt; and "the Pod network works" are &lt;strong&gt;at two different layers.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Node Ready&lt;/strong&gt; — the &lt;strong&gt;control-plane&lt;/strong&gt; path, where a node's kubelet trades heartbeats with the apiserver. That's exactly what I'd set up so far: making the node and the apiserver reach each other over Tailscale &lt;code&gt;100.x&lt;/code&gt;. As long as this path is alive, a node looks &lt;code&gt;Ready&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pod ↔ Pod (across node boundaries)&lt;/strong&gt; — the &lt;strong&gt;data-plane&lt;/strong&gt; path, where Pods on different nodes exchange packets directly. This is a &lt;strong&gt;completely separate road&lt;/strong&gt; from the control plane, and the thing that lays it isn't the kubelet but the &lt;strong&gt;CNI&lt;/strong&gt; (here, flannel).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1urlqwfodbmmzq4vr5il.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1urlqwfodbmmzq4vr5il.png" alt="Node Ready and Pod networking are different planes — control plane vs data plane" width="800" height="431"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;All six being &lt;code&gt;Ready&lt;/code&gt; with a &lt;code&gt;100.x&lt;/code&gt; (Tailscale) INTERNAL-IP only means &lt;strong&gt;the control plane is sound.&lt;/strong&gt; What finished last time went as far as "the nodes are recognized as members of one cluster" — &lt;strong&gt;the road that carries Pod traffic across node boundaries had not been verified yet.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Even with &lt;code&gt;kubectl get nodes&lt;/code&gt; all &lt;code&gt;Ready&lt;/code&gt;, inter-node Pod traffic isn't guaranteed. The control plane (node ↔ apiserver) and the data plane (Pod ↔ Pod) are separate paths, and the latter is the CNI's job. So the next question narrows to one thing — &lt;strong&gt;exactly how does flannel carry a Pod packet across node boundaries?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  2. flannel and VXLAN — how a Pod packet crosses a node boundary
&lt;/h2&gt;

&lt;p&gt;k3s uses &lt;strong&gt;flannel&lt;/strong&gt; as its CNI and &lt;strong&gt;VXLAN&lt;/strong&gt; as flannel's default backend. In #1 I brought the servers up with &lt;code&gt;--flannel-backend vxlan&lt;/code&gt;, and the #2 agents inherited that setting as-is. (&lt;a href="https://docs.k3s.io/networking/basic-network-options" rel="noopener noreferrer"&gt;k3s Basic Network Options&lt;/a&gt; — flannel's default backend is &lt;code&gt;vxlan&lt;/code&gt;; &lt;code&gt;host-gw&lt;/code&gt;, &lt;code&gt;wireguard-native&lt;/code&gt;, and &lt;code&gt;none&lt;/code&gt; are the alternatives.)&lt;/p&gt;

&lt;p&gt;Let me trace a Pod packet's journey in two cases.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Within the same node&lt;/strong&gt; — every Pod hangs off the node's &lt;code&gt;cni0&lt;/code&gt; bridge. Two Pods on the same node meet directly at L2 on that bridge. They never cross a node boundary, so it's fast and never congested. (That's why "Pods on the same node talked" in §1.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;To another node&lt;/strong&gt; — when the destination Pod is on a different node, the packet leaves &lt;code&gt;cni0&lt;/code&gt; and enters a virtual interface called &lt;code&gt;flannel.1&lt;/code&gt;. This is VXLAN's &lt;strong&gt;VTEP&lt;/strong&gt; (VXLAN Tunnel Endpoint). Here the original Pod packet (Ethernet frame and all) is &lt;strong&gt;encapsulated whole inside a UDP packet&lt;/strong&gt; and sent to the peer node.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The "address" and "port" that receive this capsule are the crux.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The port is UDP 8472.&lt;/strong&gt; On Linux, flannel's VXLAN backend uses the kernel default port &lt;strong&gt;8472/udp&lt;/strong&gt; (only on Windows does it use the IANA standard 4789). So &lt;strong&gt;nodes must be able to reach each other on 8472/udp.&lt;/strong&gt; (&lt;a href="https://github.com/flannel-io/flannel/blob/master/Documentation/backends.md" rel="noopener noreferrer"&gt;flannel backends&lt;/a&gt; — "On Linux, defaults to kernel default, currently 8472" · &lt;a href="https://docs.k3s.io/installation/requirements" rel="noopener noreferrer"&gt;k3s network requirements&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The address is each node's advertised &lt;code&gt;public-ip&lt;/code&gt;.&lt;/strong&gt; flannel advertises, per node, a destination (the VTEP's outer IP) that says "this is where I receive VXLAN capsules." The default is &lt;strong&gt;the IP of that node's default-route interface&lt;/strong&gt; (the per-node real values are in §5). This "which address gets advertised" is what trips us up all the way through.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu28ughwayv1ny0bou7po.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu28ughwayv1ny0bou7po.png" alt="flannel VXLAN dataflow — from cni0 to flannel.1, encapsulated and sent to the peer on 8472" width="800" height="431"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Encapsulation isn't free. The VXLAN header (outer IP/UDP + VXLAN + inner Ethernet) eats an extra &lt;strong&gt;50 bytes&lt;/strong&gt; per packet. So flannel sets &lt;code&gt;flannel.1&lt;/code&gt;'s MTU to &lt;strong&gt;the host interface's MTU minus 50.&lt;/strong&gt; If the host is 1500, &lt;code&gt;flannel.1&lt;/code&gt; becomes 1450.&lt;/p&gt;

&lt;p&gt;On an agent node (&lt;code&gt;lima-k3s-agent&lt;/code&gt;), it looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /run/flannel/subnet.env
&lt;span class="go"&gt;FLANNEL_NETWORK=10.42.0.0/16
FLANNEL_SUBNET=10.42.2.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true

&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;ip &lt;span class="nt"&gt;-br&lt;/span&gt; &lt;span class="nb"&gt;link&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s1"&gt;'eth0|lima0|tailscale0|flannel'&lt;/span&gt;
&lt;span class="go"&gt;eth0 UP ... mtu 1500
lima0 UP ... mtu 1500
&lt;/span&gt;&lt;span class="gp"&gt;flannel.1 UNKNOWN ... mtu 1450 #&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;1500 - 50 &lt;span class="o"&gt;(&lt;/span&gt;VXLAN&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;tailscale0 UNKNOWN ... mtu 1280 #&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;this 1280 bites &lt;span class="k"&gt;in&lt;/span&gt; §5
&lt;span class="go"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;flannel.1&lt;/code&gt; is a VXLAN device sending to port 8472 shows up in one line with &lt;code&gt;-d&lt;/code&gt; (details):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;ip &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nb"&gt;link &lt;/span&gt;show flannel.1
&lt;span class="gp"&gt;5: flannel.1: &amp;lt;...&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;mtu 1450 qdisc noqueue state UNKNOWN ...
&lt;span class="go"&gt;    vxlan id 1 local 192.168.105.2 dev lima0 srcport 0 0 dstport 8472 nolearning ttl auto ...

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;vxlan id 1&lt;/code&gt; with &lt;code&gt;dstport 8472&lt;/code&gt; — "VXLAN → send to 8472" is right there in that line. The trailing &lt;code&gt;local 192.168.105.2 dev lima0&lt;/code&gt;, i.e. "which address/interface this node sends VXLAN out of," is the real key this time — but why it has that value is something §5 untangles.&lt;/p&gt;

&lt;h3&gt;
  
  
  MTU — the maximum size of a single packet
&lt;/h3&gt;

&lt;p&gt;The output above showed &lt;code&gt;mtu 1500&lt;/code&gt;, &lt;code&gt;mtu 1450&lt;/code&gt;, and &lt;code&gt;mtu 1280&lt;/code&gt;. Since this number bites hard in §5, let me pin it down here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MTU (Maximum Transmission Unit)&lt;/strong&gt; is the maximum size (in bytes) of a single packet an interface can carry at once. Ethernet's standard default is 1500, so an ordinary NIC starts at 1500. A packet larger than the MTU is split into fragments, or — if it can't be split — simply dropped. Fragmenting is slow, and dropping makes traffic look like it's stalled.&lt;/p&gt;

&lt;p&gt;The crux is that &lt;strong&gt;the more you wrap, the less real size fits inside.&lt;/strong&gt; Wrap a box in a bigger box and the outer size (1500) stays the same, but what fits inside shrinks by the padding. That's why each interface has a different MTU.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Interface&lt;/th&gt;
&lt;th&gt;MTU&lt;/th&gt;
&lt;th&gt;Why this value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;eth0&lt;/code&gt; / &lt;code&gt;lima0&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;1500&lt;/td&gt;
&lt;td&gt;bare Ethernet default, no encapsulation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;flannel.1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1450&lt;/td&gt;
&lt;td&gt;1500 − 50. The ceiling so that one VXLAN-header (50B) wrap still stays under 1500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tailscale0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1280&lt;/td&gt;
&lt;td&gt;WireGuard's encryption overhead + a conservative value safe on any link (IPv6's minimum MTU)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8swau9jd8xashp0kbbir.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8swau9jd8xashp0kbbir.png" alt="Each wrap shrinks the MTU — 1500 → 1450 (VXLAN) → 1230 (VXLAN over Tailscale)" width="800" height="362"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If flannel VXLAN runs over physical Ethernet (1500), 1450 fits nicely. The problem is when inter-node traffic rides &lt;strong&gt;over Tailscale&lt;/strong&gt; — a Pod packet gets wrapped once by VXLAN and again by WireGuard on top, &lt;strong&gt;doubly wrapped&lt;/strong&gt; , and the size that fits inside shrinks further to about &lt;code&gt;1280 − 50 = 1230&lt;/code&gt;. flannel still sends as if it had 1450, and when the tunnel can't accept that much, the gap erupts as fragmentation, drops, and retransmits. Why this "double encapsulation" wrecks even latency is covered in §5.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Inter-node Pod traffic ultimately reduces to "can the nodes exchange UDP 8472 to each other's &lt;code&gt;public-ip&lt;/code&gt;?" If that's &lt;strong&gt;blocked&lt;/strong&gt; , the Pod network breaks (§3); if the path is &lt;strong&gt;slow&lt;/strong&gt; , the Pod network is slow (§5). It's enough to remember that &lt;code&gt;tailscale0&lt;/code&gt;'s MTU is &lt;strong&gt;1280&lt;/strong&gt; — putting VXLAN on top of it shrinks the ceiling further.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  3. Inter-node traffic failed, so I opened 8472
&lt;/h2&gt;

&lt;p&gt;If §2's conclusion holds, cross-node Pod traffic comes down to "can the nodes exchange 8472 (VXLAN) to each other's &lt;code&gt;public-ip&lt;/code&gt;?" So when traffic failed, there was one most-likely suspect — &lt;strong&gt;8472 is blocked.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's because in #1 I'd closed the firewall by the book. Cluster ports like the apiserver (6443), kubelet (10250), and &lt;strong&gt;flannel VXLAN (8472)&lt;/strong&gt; aren't opened to the public net; they're reachable only inside the private network / tailnet — the right default for minimizing exposure (#1's firewall table). So "inter-node VXLAN is blocked on the public net, which is why inter-node Pod traffic fails" was a natural hypothesis.&lt;/p&gt;

&lt;p&gt;Inside the node, flannel was indeed listening on 8472:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;ss &lt;span class="nt"&gt;-ulnp&lt;/span&gt; | &lt;span class="nb"&gt;grep &lt;/span&gt;8472
&lt;span class="go"&gt;UNCONN 0 0 0.0.0.0:8472 0.0.0.0:*

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The port is open inside the node, yet it doesn't reach Pods on another node — so what's blocked isn't inside the node but the road &lt;em&gt;between&lt;/em&gt; nodes, I figured. On a "just make it work" impulse, I &lt;strong&gt;broke the principle and opened 8472/udp inbound on the Lightsail firewall.&lt;/strong&gt; Inter-node Pod traffic went through, and I ran it that way.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo42sa3nieynlj60hzrap.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo42sa3nieynlj60hzrap.png" alt="Close 8472 and it blocks; open it and it flows — and it's exposed" width="800" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note — not a recommended setup.&lt;/strong&gt; Unlike 6443, 8472 isn't a port guarded by certificates and tokens; opening it to the public net itself widens the exposed surface. "It went through, so this is the answer" was what I thought at the time, but &lt;strong&gt;"opening the port made it work" doesn't mean "the port is why it worked"&lt;/strong&gt; — this judgment gets overturned in §8 when I look at the actual route.&lt;/p&gt;

&lt;p&gt;When inter-node Pod traffic fails, suspecting &lt;strong&gt;inter-node 8472/udp (VXLAN) reachability&lt;/strong&gt; first is a reasonable starting point. But here I opened it to the public net, and that was a debt. The bill arrives in the very next section.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  4. I added Longhorn and the restarts wouldn't stop
&lt;/h2&gt;

&lt;p&gt;Now that Pods could talk, I wanted to run a stateful workload "like a real cluster." I picked &lt;strong&gt;Longhorn&lt;/strong&gt; — distributed block storage for Kubernetes.&lt;/p&gt;

&lt;p&gt;The reason is simple. k3s's default storage (&lt;code&gt;local-path&lt;/code&gt;) is a hostPath tied to one node's disk, so it has &lt;strong&gt;no replication&lt;/strong&gt; — when a Pod moves to another node, the data doesn't follow. To try "data survives even if a node dies," you need distributed storage that replicates data across multiple nodes. Longhorn is the CNCF project that fills that role: it stands up a &lt;strong&gt;dedicated controller (the Longhorn Engine) per volume&lt;/strong&gt;, treats each volume like a microservice, and &lt;strong&gt;keeps synchronous replicas of it on multiple nodes' local disks.&lt;/strong&gt; (&lt;a href="https://longhorn.io/docs/1.11.2/what-is-longhorn/" rel="noopener noreferrer"&gt;Longhorn — What is Longhorn?&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Install was one Helm line, and at first it looked fine. Pods came up &lt;code&gt;Running&lt;/code&gt; and volumes were created.&lt;/p&gt;

&lt;p&gt;But almost immediately, &lt;strong&gt;errors and restarts began repeating endlessly.&lt;/strong&gt; Volumes dropped to &lt;code&gt;Degraded&lt;/code&gt;, replica syncs timed out, rebuilds spun up to fix them, that load made them fail again — a self-reinforcing vicious cycle. Even though I hadn't rebuilt the cluster, nodes got shaken, and other Pods on top got dragged in.&lt;/p&gt;

&lt;p&gt;The cause was &lt;strong&gt;inter-node latency.&lt;/strong&gt; Longhorn's synchronous replication waits, for every single write, until the replicas respond "written." That's why Longhorn's docs state plainly that &lt;strong&gt;"latency is far more important to a volume's stability than throughput or IOPS"&lt;/strong&gt; (&lt;a href="https://longhorn.io/docs/1.11.2/best-practices/" rel="noopener noreferrer"&gt;Longhorn Best Practices&lt;/a&gt;), and the troubleshooting guide goes further, recommending &lt;strong&gt;inter-node latency under 20ms when multiple volumes do I/O on one node at the same time&lt;/strong&gt; (&lt;a href="https://longhorn.io/kb/troubleshooting-volume-readonly-or-io-error/" rel="noopener noreferrer"&gt;Longhorn KB — volume readonly / I/O error&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;But latency between the home (lima) nodes was far above that. Measured, the &lt;strong&gt;replica's synchronous-write latency averaged 200ms+&lt;/strong&gt; — over ten times the recommended 20ms. Since synchronous replication eats that whole latency on every write, volumes couldn't climb out of &lt;code&gt;Degraded&lt;/code&gt;, and rebuilds and restarts spun forever. In the end Longhorn effectively dropped the lima (home) nodes from storage, and the goal of "run stateful workloads on a real multi-node cluster" looked impossible on top of this latency.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwxtgcss9wsy9zzwlx5bm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwxtgcss9wsy9zzwlx5bm.png" alt="Sync replication waits for the slowest node — Longhorn collapsed above 200ms" width="800" height="362"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There was one more thing that didn't add up. Those four too-slow home nodes are &lt;strong&gt;physically inside the same single iMac.&lt;/strong&gt; They're VMs in one machine — so where does 200ms come from? The next section follows that route.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Same host, so why slow? — to reach the next node, packets went to Tokyo and back
&lt;/h2&gt;

&lt;p&gt;§4's puzzle was this: four VMs inside one iMac, yet 200ms of inter-node latency. Let me follow where the packets actually go.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clue 1 — there's no direct path between VMs on the same host.&lt;/strong&gt; Printing each VM's address is odd:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;limactl shell k3s-agent &lt;span class="nt"&gt;--&lt;/span&gt; ip &lt;span class="nt"&gt;-4&lt;/span&gt; addr show eth0 | &lt;span class="nb"&gt;grep &lt;/span&gt;inet
&lt;span class="go"&gt;    inet 192.168.5.15/24 ... eth0
&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;limactl shell k3s-agent-2 &lt;span class="nt"&gt;--&lt;/span&gt; ip &lt;span class="nt"&gt;-4&lt;/span&gt; addr show eth0 | &lt;span class="nb"&gt;grep &lt;/span&gt;inet
&lt;span class="go"&gt;    inet 192.168.5.15/24 ... eth0

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both VMs' &lt;code&gt;eth0&lt;/code&gt; are &lt;strong&gt;&lt;code&gt;192.168.5.15&lt;/code&gt;.&lt;/strong&gt; Lima's default user-mode network fixes the subnet at &lt;code&gt;192.168.5.0/24&lt;/code&gt;, and each VM gets the same address behind its own independent NAT. So they're &lt;strong&gt;unreachable directly from the host and from other guests (VMs)&lt;/strong&gt; — they don't even know each other, despite being inside the same iMac. Lima's own docs note this limitation and point you to "use VMNet to access from the host or other guests" (&lt;a href="https://lima-vm.io/docs/config/network/user/" rel="noopener noreferrer"&gt;Lima — user-mode network&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clue 2 — with no direct path, even reaching the next VM takes the long-distance road.&lt;/strong&gt; Since the VMs can't reach each other directly, inter-node packets (including Pod VXLAN) take the one common path — the Tailscale (&lt;code&gt;100.x&lt;/code&gt;) that bound both sites in #1 and #2. Reaching the VM right next door is no different. Tailscale is a tool for connecting long distances across NAT, so when a direct (P2P) path isn't possible, it detours through a relay (DERP). Here's what &lt;code&gt;tailscale netcheck&lt;/code&gt; said:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;tailscale netcheck
&lt;span class="go"&gt;   * MappingVariesByDestIP: true ← NAT where direct P2P is hard (endpoint-dependent mapping)
   * Nearest DERP: Tokyo (30.3ms)

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;MappingVariesByDestIP: true&lt;/code&gt;, a direct connection isn't possible, so Tailscale detours through the &lt;strong&gt;nearest DERP relay — Tokyo.&lt;/strong&gt; A packet from Sapporo's VM A to its neighbor VM B left the house, went all the way to Tokyo, and came back to Sapporo.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdm0u9ljodpq2mnby5kwn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdm0u9ljodpq2mnby5kwn.png" alt="Same iMac VMs — yet packets went via Tokyo DERP and back" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clue 3 — measure it and that detour is right there.&lt;/strong&gt; Latency between VMs on the same iMac:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;lima ↔ lima-2 : 87 ms (max 202, jitter 54)
lima ↔ lima-3 : 133 ms (max 268, jitter 92)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tens to a hundred ms for the VM right next door — the cost of a Tokyo round-trip. And at this point the home nodes' Pod VXLAN, whose destination (&lt;code&gt;public-ip&lt;/code&gt;) is a tailnet &lt;code&gt;100.x&lt;/code&gt;, has &lt;strong&gt;flannel VXLAN riding on top of Tailscale (WireGuard)&lt;/strong&gt; as well — piling on §2's double encapsulation and the MTU-1280 squeeze. Since Longhorn's synchronous replication ate this round-trip on every write, 200ms was the obvious result.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Everything so far is about &lt;strong&gt;VMs inside the same iMac (lima ↔ lima).&lt;/strong&gt; The &lt;strong&gt;cross-site&lt;/strong&gt; link (home ↔ Tokyo) takes a different path (no double encapsulation there), covered separately in §8.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nodes being on the same physical host doesn't make them fast.&lt;/strong&gt; When the virtualization network isolates the VMs, even talking to the VM right next door can detour far away through the overlay. Nearby traffic should end nearby — and §6 and §7 find that road.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  6. How to fix it
&lt;/h2&gt;

&lt;p&gt;The problem was clear. The four VMs on one iMac &lt;strong&gt;have no LAN that reaches each other directly,&lt;/strong&gt; so even traffic to the VM next door goes through the long-distance Tailscale + DERP. I lined up the candidates for fixing it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd4ehvkelugek674g3e9m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd4ehvkelugek674g3e9m.png" alt="Fixing inter-node latency — 1-4 just rewrap on the slow path, 5 (vmnet) changes the path itself" width="800" height="362"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pin flannel to tailscale0 (&lt;code&gt;--flannel-iface=tailscale0&lt;/code&gt;)&lt;/strong&gt; — put VXLAN explicitly on the tailnet. But the root problem (a long-distance detour and double encap despite being on the same host) stays. And the servers use a VPC address; setting only the agents to &lt;code&gt;tailscale0&lt;/code&gt; makes the destinations disagree, and it &lt;strong&gt;breaks asymmetrically — traffic passes one way only&lt;/strong&gt; (you'd have to change both, touching #1's server config too). Latency doesn't drop either.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;flannel's &lt;code&gt;host-gw&lt;/code&gt; backend&lt;/strong&gt; — route directly to node IPs with no encapsulation; fastest. But it &lt;strong&gt;assumes direct L2 connectivity between all nodes&lt;/strong&gt; (&lt;a href="https://docs.k3s.io/networking/basic-network-options" rel="noopener noreferrer"&gt;k3s docs&lt;/a&gt;). There's no L2 between Tokyo and Sapporo, and none between the user-mode-isolated VMs either.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;flannel's &lt;code&gt;wireguard-native&lt;/code&gt; backend&lt;/strong&gt; — encrypt with WireGuard instead of VXLAN. Good for security, but it still runs over the same detour path, so latency is unchanged.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Force Tailscale into P2P&lt;/strong&gt; — connect directly instead of via DERP. But what blocks the direct path is &lt;strong&gt;Lima's default user-mode network isolating the VMs,&lt;/strong&gt; so it can't be solved from the Tailscale side alone.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What 1–4 have in common is that they &lt;strong&gt;just change the wrapping on a slow path.&lt;/strong&gt; The path itself (leaving the same machine and coming back) stays, so latency doesn't drop. So I flipped the idea around.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Use the fact that they're on the same host — give the VMs a real LAN.&lt;/strong&gt; Since the four live in one iMac, if the host lays a virtual LAN and lets the VMs talk &lt;strong&gt;directly at L2,&lt;/strong&gt; they never go through Tailscale or DERP at all. This is exactly the method Lima's docs recommend for the user-mode isolation ( &lt;strong&gt;VMNet&lt;/strong&gt; ), implemented on macOS via &lt;strong&gt;socket_vmnet.&lt;/strong&gt; ← &lt;strong&gt;adopted.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;The test for picking a candidate was one thing — &lt;strong&gt;"does it remove the root cause of the latency (no direct path)?"&lt;/strong&gt; 1–4 try to "go faster" or "re-wrap with a different VPN" and leave the path alone, but vmnet &lt;strong&gt;changes the path itself into a LAN inside the house.&lt;/strong&gt; Then Tailscale handles the long-distance Tokyo ↔ Sapporo leg, and same-host traffic ends at home. The next section applies it and checks the result.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  7. Laying a LAN inside the house with socket_vmnet
&lt;/h2&gt;

&lt;h3&gt;
  
  
  7-1. vmnet and socket_vmnet
&lt;/h3&gt;

&lt;p&gt;The problem from §5 was "the VMs on one iMac have no LAN that reaches each other directly." What fills that gap is &lt;strong&gt;vmnet.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vmnet is macOS's built-in virtual-networking framework&lt;/strong&gt; (&lt;code&gt;vmnet.framework&lt;/code&gt;) — Apple's official API that builds NAT, bridges, and host networking for VMs. But using it directly requires the VM process to hold root privileges and an entitlement, which is a hassle. &lt;strong&gt;socket_vmnet&lt;/strong&gt; is a small daemon built by the Lima project that wraps this &lt;code&gt;vmnet.framework&lt;/code&gt; and exposes it &lt;strong&gt;over a Unix socket.&lt;/strong&gt; Only the socket_vmnet daemon runs as root; the VMs just connect to that socket — &lt;strong&gt;the VMs themselves don't need to run as root.&lt;/strong&gt; (That's why the install in 7-2 puts the binary in a root-owned &lt;code&gt;/opt&lt;/code&gt; and lays down a sudoers entry.) (&lt;a href="https://github.com/lima-vm/socket_vmnet" rel="noopener noreferrer"&gt;socket_vmnet&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;socket_vmnet offers three modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;shared&lt;/strong&gt; — a private subnet (&lt;code&gt;192.168.105.0/24&lt;/code&gt;) + internet NAT. The &lt;strong&gt;VMs connected to the same socket_vmnet sit on one virtual switch (L2) and talk to each other directly.&lt;/strong&gt; ← what we need.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;bridged&lt;/strong&gt; — joins the VMs straight onto the host's physical LAN (e.g. &lt;code&gt;en0&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;host&lt;/strong&gt; — an isolated network with no internet.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;(&lt;a href="https://lima-vm.io/docs/config/network/vmnet/" rel="noopener noreferrer"&gt;Lima — VMNet&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;How the structure changes is the crux.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Before&lt;/strong&gt; — each VM had only &lt;code&gt;eth0&lt;/code&gt; (Lima's default user-mode, &lt;code&gt;192.168.5.15&lt;/code&gt;, mutually isolated) and &lt;code&gt;tailscale0&lt;/code&gt; (&lt;code&gt;100.x&lt;/code&gt;). With no direct path between VMs, inter-node traffic (including Pod VXLAN) leaked onto &lt;code&gt;tailscale0&lt;/code&gt;, causing §5's long-distance detour.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;After&lt;/strong&gt; — each VM gets &lt;strong&gt;one more interface, &lt;code&gt;lima0&lt;/code&gt;&lt;/strong&gt; (vmnet shared, &lt;code&gt;192.168.105.x&lt;/code&gt;). &lt;code&gt;eth0&lt;/code&gt; and &lt;code&gt;tailscale0&lt;/code&gt; stay as they were, and &lt;strong&gt;the k3s node's InternalIP is still the Tailscale &lt;code&gt;100.x&lt;/code&gt;,&lt;/strong&gt; so the cluster's identity and membership don't change. Only the data path changes — &lt;code&gt;lima0&lt;/code&gt; becomes the node's default route, and per the rule from §2 ("flannel picks the default-route interface as its VXLAN destination / public-ip"), &lt;strong&gt;flannel moves its VXLAN destination to &lt;code&gt;lima0&lt;/code&gt;.&lt;/strong&gt; So Pod traffic between home nodes finishes over vmnet without going through the tailnet or DERP.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short, it's an &lt;strong&gt;additive change&lt;/strong&gt; — without rebuilding the cluster or touching the nodes' identity, you add one interface (&lt;code&gt;lima0&lt;/code&gt;) and reroute only same-host traffic onto a fast LAN.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F59sajzxx8k1m3w30an4l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F59sajzxx8k1m3w30an4l.png" alt="What vmnet changes — adding one lima0 interface" width="800" height="431"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  7-2. Install socket_vmnet (host = the iMac)
&lt;/h3&gt;

&lt;p&gt;Because socket_vmnet is a daemon that runs as root, the binary must live on a &lt;strong&gt;root-owned path that a user can't tamper with.&lt;/strong&gt; Lima discourages installing it via Homebrew for security reasons, so build it from source into &lt;code&gt;/opt/socket_vmnet&lt;/code&gt;. (&lt;a href="https://lima-vm.io/docs/config/network/vmnet/" rel="noopener noreferrer"&gt;Lima — VMNet&lt;/a&gt;)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;git clone https://github.com/lima-vm/socket_vmnet
cd socket_vmnet
&lt;/span&gt;&lt;span class="gp"&gt;git checkout v1.2.2 #&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;check the latest stable tag on the releases page
&lt;span class="go"&gt;make
sudo make PREFIX=/opt/socket_vmnet install.bin
&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;→ /opt/socket_vmnet/bin/socket_vmnet &lt;span class="o"&gt;(&lt;/span&gt;root-owned&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  7-3. Register the Lima sudoers entry
&lt;/h3&gt;

&lt;p&gt;So Lima can launch socket_vmnet as root, lay down a sudoers fragment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;limactl sudoers &amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;etc_sudoers.d_lima
&lt;span class="gp"&gt;less etc_sudoers.d_lima #&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;review the contents first
&lt;span class="go"&gt;sudo install -o root etc_sudoers.d_lima /etc/sudoers.d/lima
rm etc_sudoers.d_lima

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  7-4. Attach the shared network to each VM
&lt;/h3&gt;

&lt;p&gt;Add the shared network to each VM's &lt;code&gt;~/.lima/&amp;lt;vm&amp;gt;/lima.yaml&lt;/code&gt;. This network lays down &lt;code&gt;192.168.105.0/24&lt;/code&gt; (gateway &lt;code&gt;192.168.105.1&lt;/code&gt;) and gives each VM an address in that range via a &lt;code&gt;lima0&lt;/code&gt; interface.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;lima&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;shared&lt;/span&gt;
    &lt;span class="na"&gt;interface&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;lima0&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note — duplicate &lt;code&gt;networks:&lt;/code&gt; key.&lt;/strong&gt; If a &lt;code&gt;networks:&lt;/code&gt; key already exists in lima.yaml (even as a comment), appending another at the end causes a YAML duplicate-key parse error. Merge under the existing key.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  7-5. Restart one at a time
&lt;/h3&gt;

&lt;p&gt;Restart one VM at a time, confirming each goes &lt;code&gt;Ready&lt;/code&gt; again (to minimize cluster impact).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;for &lt;/span&gt;VM &lt;span class="k"&gt;in &lt;/span&gt;k3s-agent k3s-agent-2 k3s-agent-3 k3s-agent-4&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;limactl stop &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$VM&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  limactl start &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$VM&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; k3s-lightsail &lt;span class="nb"&gt;wait&lt;/span&gt; &lt;span class="nt"&gt;--for&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;condition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Ready node/lima-&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$VM&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;180s
&lt;span class="k"&gt;done&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On restart, k3s comes back up and flannel re-picks its interface. Since the default route is now &lt;code&gt;lima0&lt;/code&gt; (192.168.105.1), flannel &lt;strong&gt;re-advertises its VXLAN destination (&lt;code&gt;public-ip&lt;/code&gt;) as the vmnet address&lt;/strong&gt; per the rule from §2 / §7-1 — with no extra flag like &lt;code&gt;--flannel-iface&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  7-6. Verify
&lt;/h3&gt;

&lt;p&gt;Check that each VM got a vmnet address on &lt;code&gt;lima0&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;limactl shell k3s-agent &lt;span class="nt"&gt;--&lt;/span&gt; ip &lt;span class="nt"&gt;-4&lt;/span&gt; addr show lima0 | &lt;span class="nb"&gt;grep &lt;/span&gt;inet
&lt;span class="go"&gt;    inet 192.168.105.2/24 ... lima0

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then re-measure ping between VMs on the same iMac:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;limactl shell k3s-agent &lt;span class="nt"&gt;--&lt;/span&gt; ping &lt;span class="nt"&gt;-c5&lt;/span&gt; &lt;span class="nt"&gt;-q&lt;/span&gt; 192.168.105.3
&lt;span class="go"&gt;5 packets transmitted, 5 received, 0% packet loss
rtt min/avg/max/mdev = 0.449/0.571/0.639/0.064 ms

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same-host VM-to-VM latency that was &lt;code&gt;87~133 ms&lt;/code&gt; dropped to the &lt;strong&gt;0.5 ms range.&lt;/strong&gt; The packets that had been round-tripping to Tokyo now finish inside the iMac.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvcpiehmrmnii4jdgrus.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvcpiehmrmnii4jdgrus.png" alt="After socket_vmnet — from an 87-133ms Tokyo detour to a 0.5ms direct lima0 path" width="799" height="335"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;By default, flannel picks "the default-route interface" as its VXLAN destination.&lt;/strong&gt; So if you give a node a faster direct path (here, vmnet's &lt;code&gt;lima0&lt;/code&gt;) and make it the default route, flannel switches over to it on its own — with no extra CNI config.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  8. Result — what vmnet changed, and the truth about 8472
&lt;/h2&gt;

&lt;p&gt;After applying it, the VXLAN destination (&lt;code&gt;public-ip&lt;/code&gt;) that flannel advertises splits cleanly into the two sites:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; k3s-lightsail get nodes &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="go"&gt;    -o custom-columns='NAME:.metadata.name,PUBLIC-IP:.metadata.annotations.flannel\.alpha\.coreos\.com/public-ip'
NAME PUBLIC-IP
&lt;/span&gt;&lt;span class="gp"&gt;ip-172-26-2-70 172.26.2.70 #&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Tokyo &lt;span class="o"&gt;(&lt;/span&gt;server&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; VPC
&lt;span class="go"&gt;ip-172-26-3-146 172.26.3.146
&lt;/span&gt;&lt;span class="gp"&gt;lima-k3s-agent 192.168.105.2 #&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;home &lt;span class="o"&gt;(&lt;/span&gt;lima&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; vmnet
&lt;span class="go"&gt;lima-k3s-agent-2 192.168.105.3
lima-k3s-agent-3 192.168.105.4
lima-k3s-agent-4 192.168.105.5

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The four home nodes now exchange VXLAN directly over each other's vmnet address (&lt;code&gt;192.168.105.x&lt;/code&gt;) — 0.5 ms.&lt;/p&gt;

&lt;h3&gt;
  
  
  The 8472 I opened in §3 — I went to close it and looked at the actual route
&lt;/h3&gt;

&lt;p&gt;With the home nodes sorted out by vmnet, it was time to close the 8472 I'd opened against principle in §3. But the moment I went to close it, I got nervous — &lt;strong&gt;what if closing it breaks inter-node Pod traffic again?&lt;/strong&gt; Because in §3 I'd thought "opening it made it work."&lt;/p&gt;

&lt;p&gt;So before closing, instead of guessing I &lt;strong&gt;looked at the actual route.&lt;/strong&gt; All you have to do is query the route from one node to a Pod on a node at the other site.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;from a home &lt;span class="o"&gt;(&lt;/span&gt;lima&lt;span class="o"&gt;)&lt;/span&gt; node to a Pod on a Tokyo &lt;span class="o"&gt;(&lt;/span&gt;server&lt;span class="o"&gt;)&lt;/span&gt; node
&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;ip route get 10.42.0.235
&lt;span class="go"&gt;10.42.0.235 dev tailscale0 table 52 src 100.84.x.x ...

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;dev tailscale0&lt;/code&gt;, not &lt;code&gt;dev flannel.1&lt;/code&gt;. In other words, &lt;strong&gt;cross-site Pod traffic was flowing directly over Tailscale, not flannel VXLAN (8472).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's the mechanism. In this cluster, each node &lt;strong&gt;advertises its own pod CIDR (&lt;code&gt;10.42.N.0/24&lt;/code&gt;) as a Tailscale subnet route&lt;/strong&gt; and accepts the others' (&lt;code&gt;accept-routes&lt;/code&gt;). So a remote node's Pod range bypasses flannel and travels directly over the encrypted Tailscale (WireGuard). This is also the official way k3s binds a distributed / multi-cloud cluster with Tailscale (&lt;a href="https://docs.k3s.io/networking/distributed-multicloud" rel="noopener noreferrer"&gt;k3s — Distributed/multicloud&lt;/a&gt;, &lt;a href="https://tailscale.com/kb/1019/subnets" rel="noopener noreferrer"&gt;Tailscale — Subnet routers&lt;/a&gt;). So the "double encapsulation, wrapping VXLAN over WireGuard" from §2 / §5 was the old path of the same host (lima ↔ lima); &lt;strong&gt;cross-site is a single layer of WireGuard.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Having confirmed the route, I closed it — restricting it from public (&lt;code&gt;0.0.0.0/0&lt;/code&gt;) to VPC-private (&lt;code&gt;172.26.0.0/16&lt;/code&gt;). Then I re-measured after closing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cross-site (home → Tokyo Pod) : 20~26 ms (Tailscale, unchanged)
home nodes (lima ↔ lima) : 0.4 ms (vmnet, unchanged)
servers (server ↔ server) : 0.3 ms (VPC VXLAN, unchanged)
6 nodes Ready : 6/6

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nothing broke. &lt;strong&gt;The public 8472 wasn't the actual route of inter-node traffic — it was just a leftover.&lt;/strong&gt; 8472 is still in use, but only inside private networks — lima ↔ lima over vmnet, server ↔ server over VPC.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjrttj0l9z9y6j0s2obl1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjrttj0l9z9y6j0s2obl1.png" alt="Cross-site was flowing over Tailscale — the public 8472 was a leftover" width="800" height="362"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Why bother closing it? &lt;strong&gt;VXLAN is a protocol with no authentication and no encryption&lt;/strong&gt; (RFC 7348's security considerations also state plainly that "VXLAN itself provides no authentication or encryption"). Its only identifier is the VNI, and flannel's default is &lt;code&gt;1&lt;/code&gt;, so anyone who can reach 8472 on the public net can inject packets into the Pod overlay (&lt;code&gt;10.42.0.0/16&lt;/code&gt;). 6443 at least has a gate of certificates and tokens; 8472 doesn't even have that. So while I was at it, I also closed the apiserver (6443) and kubelet (10250) to the public net — VPC/tailnet only — and restricted SSH (22) to the tailnet.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"It worked after a change" doesn't mean "the change is why it worked."&lt;/strong&gt; Opening 8472 and traffic going through was a fact, but what actually carried the traffic was Tailscale — the public 8472 was a leftover from the start. Before you open and close ports on a guess, &lt;strong&gt;read the actual route once with &lt;code&gt;ip route get&lt;/code&gt;&lt;/strong&gt; — that's the cheapest way to cut costly exposure.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  A remaining limit — cross-site is far, by the distance
&lt;/h3&gt;

&lt;p&gt;What vmnet fixed is &lt;strong&gt;inside the same host (between home nodes).&lt;/strong&gt; The Tokyo-cloud ↔ Sapporo-home leg is physically far apart, so it's bound by Tailscale, and that latency (tens of ms) is a value set by distance that you can't shrink.&lt;/p&gt;

&lt;p&gt;So I &lt;strong&gt;compensate with placement&lt;/strong&gt; — keep workloads with a lot of inter-node synchronous traffic (latency-sensitive ones) &lt;strong&gt;gathered within the home node group.&lt;/strong&gt; I use the &lt;code&gt;node-type=lima&lt;/code&gt; label I'd put on the home nodes back in #2, via a &lt;code&gt;nodeSelector&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; k3s-lightsail get nodes &lt;span class="nt"&gt;-L&lt;/span&gt; node-type
&lt;span class="go"&gt;NAME ... NODE-TYPE
ip-172-26-2-70 ... lightsail
ip-172-26-3-146 ... lightsail
lima-k3s-agent ... lima
lima-k3s-agent-2 ... lima
lima-k3s-agent-3 ... lima
lima-k3s-agent-4 ... lima

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a workload's manifest, you write it like this (if the label isn't there, set it first with &lt;code&gt;kubectl label node lima-k3s-agent node-type=lima&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;nodeSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;node-type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;lima&lt;/span&gt; &lt;span class="c1"&gt;# this Pod group only on home nodes (vmnet, 0.5ms)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3vo0yhx8fmtweksvmgy1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3vo0yhx8fmtweksvmgy1.png" alt="Final picture — vmnet for what's near, Tailscale for what's far, placement via nodeSelector" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Even within one cluster, inter-node latency isn't uniform.&lt;/strong&gt; Accept the fast leg (same host = vmnet) and the slow leg (long distance = Tailscale), and place workloads by latency with a &lt;code&gt;nodeSelector&lt;/code&gt; to steer around the slow leg.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  9. Glossary — what came up this time
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;flannel / VXLAN&lt;/strong&gt; — k3s's default CNI is flannel, its default backend VXLAN. It carries cross-node Pod packets encapsulated in UDP (8472).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VTEP / &lt;code&gt;flannel.1&lt;/code&gt;&lt;/strong&gt; — the endpoint that wraps and unwraps VXLAN capsules. Exists per node as the &lt;code&gt;flannel.1&lt;/code&gt; interface.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;flannel's &lt;code&gt;public-ip&lt;/code&gt;&lt;/strong&gt; — the destination each node advertises as "this is where I receive my VXLAN." Defaults to the node's default-route interface IP.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MTU&lt;/strong&gt; — the maximum size of a single packet. Shrinks the more you encapsulate (Ethernet 1500 → VXLAN 1450 → 1230 over Tailscale).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;double encapsulation&lt;/strong&gt; — wrapping a VXLAN packet again in WireGuard. Overhead, MTU squeeze, and latency pile up. (In this cluster it only happened for lima ↔ lima before vmnet; not for cross-site.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tailscale / DERP&lt;/strong&gt; — a WireGuard-based mesh VPN. When a direct (P2P) path isn't possible, it detours through a DERP relay.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tailscale subnet route&lt;/strong&gt; — a node advertises a range (here, its own pod CIDR) with &lt;code&gt;--advertise-routes&lt;/code&gt;, another node receives it with &lt;code&gt;--accept-routes&lt;/code&gt;, and that range's traffic travels over the tailnet (WireGuard). Cross-site Pod traffic flows over this road.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NAT (endpoint-dependent mapping)&lt;/strong&gt; — a NAT where the port mapping varies by destination. Direct P2P is hard, so it falls back to DERP.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vmnet (&lt;code&gt;vmnet.framework&lt;/code&gt;)&lt;/strong&gt; — Apple's framework that provides NAT, bridges, and host networking to VMs on macOS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;socket_vmnet&lt;/strong&gt; — a root daemon that exposes &lt;code&gt;vmnet.framework&lt;/code&gt; over a Unix socket. VMs connect to the socket without root, get a shared LAN (&lt;code&gt;192.168.105.0/24&lt;/code&gt;), and talk directly at L2 between VMs on the same host.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lima user-mode network&lt;/strong&gt; — Lima's default network (fixed at &lt;code&gt;192.168.5.0/24&lt;/code&gt;). Isolates VMs from each other and from the host.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;node-type&lt;/code&gt; label / nodeSelector&lt;/strong&gt; — a label on the nodes (here, lima/lightsail). Used in a &lt;code&gt;nodeSelector&lt;/code&gt; to place workloads on a particular group of nodes.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  10. Next
&lt;/h2&gt;

&lt;p&gt;With inter-node traffic sorted out, I can finally run all sorts of services stably on these six nodes.&lt;/p&gt;

&lt;p&gt;In the end, the important problem that Longhorn surfaced was solved well, and it's still running stably today. That said, given the nodes' spec limits, I decided to give up on Longhorn for now. Once I have a roomier environment, I'm leaving "put it back and test it" as homework.&lt;/p&gt;

&lt;p&gt;What I've written up to now lets me run a fair range of services, but optimization and other small issues I've been handling as I operate. Once the material piles up a bit, I'd like to gather and organize that too.&lt;/p&gt;

&lt;p&gt;Next time I plan to talk about &lt;strong&gt;CloudNativePG (CNPG).&lt;/strong&gt; Right as the inter-node networking got solved, I set up CNPG to practice and verify running a service with internal clustering — and it's now serving as the main DB for quite a few services.&lt;/p&gt;

&lt;p&gt;Thanks for reading all the way through.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;k3s — Basic Network Options / Requirements / Distributed·multicloud (Tailscale integration): &lt;a href="https://docs.k3s.io/networking/basic-network-options" rel="noopener noreferrer"&gt;docs.k3s.io/networking/basic-network-options&lt;/a&gt; · &lt;a href="https://docs.k3s.io/installation/requirements" rel="noopener noreferrer"&gt;/installation/requirements&lt;/a&gt; · &lt;a href="https://docs.k3s.io/networking/distributed-multicloud" rel="noopener noreferrer"&gt;/networking/distributed-multicloud&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;flannel — Backends (VXLAN · 8472 · MTU): &lt;a href="https://github.com/flannel-io/flannel/blob/master/Documentation/backends.md" rel="noopener noreferrer"&gt;github.com/flannel-io/flannel … backends.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Longhorn — What is Longhorn / Best Practices / KB (volume readonly or I/O error): &lt;a href="https://longhorn.io/docs/1.11.2/what-is-longhorn/" rel="noopener noreferrer"&gt;longhorn.io/docs/1.11.2/what-is-longhorn&lt;/a&gt; · &lt;a href="https://longhorn.io/docs/1.11.2/best-practices/" rel="noopener noreferrer"&gt;/best-practices&lt;/a&gt; · &lt;a href="https://longhorn.io/kb/troubleshooting-volume-readonly-or-io-error/" rel="noopener noreferrer"&gt;/kb/troubleshooting-volume-readonly-or-io-error&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Tailscale — Device connectivity / How NAT traversal works / Subnet routers: &lt;a href="https://tailscale.com/kb/1411/device-connectivity" rel="noopener noreferrer"&gt;tailscale.com/kb/1411/device-connectivity&lt;/a&gt; · &lt;a href="https://tailscale.com/blog/how-nat-traversal-works" rel="noopener noreferrer"&gt;tailscale.com/blog/how-nat-traversal-works&lt;/a&gt; · &lt;a href="https://tailscale.com/kb/1019/subnets" rel="noopener noreferrer"&gt;tailscale.com/kb/1019/subnets&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Lima / socket_vmnet — User-mode network / VMNet: &lt;a href="https://lima-vm.io/docs/config/network/user/" rel="noopener noreferrer"&gt;lima-vm.io/docs/config/network/user&lt;/a&gt; · &lt;a href="https://lima-vm.io/docs/config/network/vmnet/" rel="noopener noreferrer"&gt;/vmnet&lt;/a&gt; · &lt;a href="https://github.com/lima-vm/socket_vmnet" rel="noopener noreferrer"&gt;github.com/lima-vm/socket_vmnet&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;VXLAN security (no auth/encryption) — &lt;a href="https://www.rfc-editor.org/rfc/rfc7348#section-6" rel="noopener noreferrer"&gt;RFC 7348 §6 Security Considerations&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>Hybrid k3s #2: Welcoming the sleeping iMac as a teammate (4 Lima VM agents)</title>
      <dc:creator>SEON</dc:creator>
      <pubDate>Wed, 03 Jun 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/seon/hybrid-k3s-2-welcoming-the-sleeping-imac-as-a-teammate-4-lima-vm-agents-3011</link>
      <guid>https://dev.to/seon/hybrid-k3s-2-welcoming-the-sleeping-imac-as-a-teammate-4-lima-vm-agents-3011</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.seon.world%2Fimages%2Fk3s-2%2Fk3s-2-master-en.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.seon.world%2Fimages%2Fk3s-2%2Fk3s-2-master-en.png" alt="Hybrid k3s — full architecture" width="800" height="592"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  0. About this series
&lt;/h2&gt;

&lt;p&gt;This series is a record — written one piece at a time — of how I actually built the homelab in the diagram above, the one that's still running as I write this.&lt;/p&gt;

&lt;p&gt;What began as a toy project from a simple "could this even work?" turned, through satisfying performance and endless tearing-down-and-rebuilding, into a genuine toy that takes the edge off the stress built up at work. It isn't a resource-rich cluster, but it's been more than enough to get a real taste of Kubernetes, and it keeps handing me the next thing I want to try.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;6 nodes&lt;/strong&gt; — 2 Lightsail &lt;strong&gt;servers&lt;/strong&gt; (control plane + etcd) in the cloud (AWS Tokyo) + 4 &lt;strong&gt;Lima VM agents&lt;/strong&gt; on a home (Sapporo) iMac&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;19 vCPU / 61 GiB&lt;/strong&gt; total, &lt;strong&gt;49 namespaces&lt;/strong&gt; , &lt;strong&gt;248 pods (150 running)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Deployed with &lt;strong&gt;ArgoCD&lt;/strong&gt; , auth via &lt;strong&gt;Keycloak OIDC&lt;/strong&gt; , with CloudNativePG, Vault, CrowdSec, Prometheus/Grafana and more running on top&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;This article is about taking the &lt;strong&gt;two-node cloud cluster from #1 and welcoming the iMac that was gathering dust at home — split into 4 Lima VMs that join as agents.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  1. Background — welcoming the sleeping iMac as a teammate
&lt;/h2&gt;

&lt;p&gt;What I built in #1 was the cloud-side foundation.&lt;/p&gt;

&lt;p&gt;I put k3s on two AWS Lightsail instances (8GB and 16GB), formed a two-node &lt;strong&gt;control plane + embedded etcd&lt;/strong&gt; , and bound both nodes over Tailscale so they could call each other by a &lt;code&gt;100.x&lt;/code&gt; address. Instead of the textbook three nodes I went with two, taking out insurance with automatic etcd snapshots — a cluster that was, in effect, just the "head." (Plenty of apps are crammed onto those nodes too, penny-pincher that I am.)&lt;/p&gt;

&lt;p&gt;This time I'm adding the "limbs."&lt;/p&gt;

&lt;p&gt;At home, a fairly old &lt;strong&gt;64GB-RAM iMac&lt;/strong&gt; sits idle. It's slow — it has an HDD — but memory is the one thing it has plenty of, and its macOS is new enough to run virtualization (vz), so as a host it's more than enough. The goal this time is to bring it in as a cluster worker.&lt;/p&gt;

&lt;p&gt;But there was one thing to decide right at the start.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Bring the iMac in whole as a single node, or split it into several?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The easy path is whole.&lt;/p&gt;

&lt;p&gt;Install Ubuntu on the iMac, stand up one k3s agent, and you're done. But the whole point of this homelab is "to handle Kubernetes like the real thing." With that in mind, I weighed the two options against the official guidance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The limit of going whole (one node).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Kubernetes docs recommend &lt;a href="https://kubernetes.io/docs/setup/best-practices/cluster-large/" rel="noopener noreferrer"&gt;at least one instance per failure zone&lt;/a&gt; for fault tolerance. If the home side is a single node, that node is itself a single point of failure — and, more to the point, &lt;strong&gt;none of the practice that assumes multiple nodes is possible.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;cordon&lt;/code&gt;/&lt;code&gt;drain&lt;/code&gt; a node and shift its workloads off, spread Pods across nodes (anti-affinity), roll a node out and back in — with one node, all of it is meaningless.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The worth of splitting (multiple nodes).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Add several nodes and the blast radius shrinks, while spreading becomes possible. &lt;a href="https://learnkube.com/kubernetes-node-size" rel="noopener noreferrer"&gt;learnkube's worker-node sizing analysis&lt;/a&gt; shows this in numbers — with five nodes you can scatter five replicas onto separate nodes, so &lt;strong&gt;losing one node costs you at most one replica.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With only two nodes, no matter how many replicas you add, the effective spread tops out at two.&lt;/p&gt;

&lt;p&gt;Splitting isn't free, of course.&lt;/p&gt;

&lt;p&gt;As that same article points out, every node reserves resources for kubelet and the OS — a 1 vCPU/4GB node gives up about 1.1GB, a 4 vCPU/32GB node about 3.66GB — so &lt;strong&gt;the finer you slice, the larger the system-overhead ratio.&lt;/strong&gt; The pods-per-node count is also &lt;a href="https://kubernetes.io/docs/setup/best-practices/cluster-large/" rel="noopener noreferrer"&gt;capped at 110 by default&lt;/a&gt;. In short, "infinitely fine" isn't the answer; you want a balance point of &lt;strong&gt;reasonably sized nodes in a reasonable number.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The conclusion was to split.&lt;/p&gt;

&lt;p&gt;For a learning-focused homelab where you want to handle scheduling and failure "like a real cluster," the home side should be multiple nodes too. That 64GB of memory is what makes the luxury possible. (How many to split into is decided in §3.)&lt;/p&gt;

&lt;p&gt;And here another question branches off. &lt;strong&gt;How do you split one physical machine into multiple nodes?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Lima VM
&lt;/h2&gt;

&lt;p&gt;There are several ways to turn one physical iMac into multiple k3s nodes. I lined up the candidates and filtered them against this homelab's conditions ("run headless as a server around the clock, mass-produce identical machines reproducibly, keep macOS").&lt;/p&gt;

&lt;p&gt;The biggest fork is &lt;strong&gt;fake the nodes with containers, or make real nodes with VMs.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7vbk6kzub9ptorn7doey.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7vbk6kzub9ptorn7doey.png" alt="One iMac into multiple nodes: 5 ways compared" width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Character&lt;/th&gt;
&lt;th&gt;For this situation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bare-metal Ubuntu reinstall&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Install Linux straight on the iMac, one node&lt;/td&gt;
&lt;td&gt;Have to wipe macOS, and you still end up with &lt;strong&gt;one node&lt;/strong&gt; → conflicts with the point of splitting, excluded&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://k3d.io/" rel="noopener noreferrer"&gt;&lt;strong&gt;Docker + k3d&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Stand up k3s in containers to mimic multi-node&lt;/td&gt;
&lt;td&gt;Nodes are containers, so they &lt;strong&gt;share the host kernel&lt;/strong&gt; (weak isolation); the "real node" feel is thin → underwhelming for learning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;a href="https://multipass.run/" rel="noopener noreferrer"&gt;&lt;strong&gt;Multipass&lt;/strong&gt;&lt;/a&gt; (Canonical)&lt;/td&gt;
&lt;td&gt;Launch an Ubuntu VM in one line&lt;/td&gt;
&lt;td&gt;VM = real node, fine, but &lt;strong&gt;Ubuntu only&lt;/strong&gt; , and it leans toward one-shot launches rather than declaratively mass-producing identical VMs (the backend has been &lt;a href="https://documentation.ubuntu.com/multipass/latest/how-to-guides/customise-multipass/migrate-from-hyperkit-to-qemu-on-macos/" rel="noopener noreferrer"&gt;QEMU by default since 1.12&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;a href="https://www.virtualbox.org/" rel="noopener noreferrer"&gt;&lt;strong&gt;VirtualBox&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;/ VMware Fusion&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Traditional GUI hypervisors&lt;/td&gt;
&lt;td&gt;Run on Intel Macs but heavy and GUI-centric; scripting N of them is a pain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://mac.getutm.app/" rel="noopener noreferrer"&gt;&lt;strong&gt;UTM&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;A macOS GUI front end for QEMU&lt;/td&gt;
&lt;td&gt;Nice for making 1–2 by GUI, but not a great fit for headless / reproducibility&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/abiosoft/colima" rel="noopener noreferrer"&gt;&lt;strong&gt;Colima&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;"Containers on Lima," a Docker/k8s abstraction&lt;/td&gt;
&lt;td&gt;Uses Lima underneath; its aim is providing a container runtime, not defining VMs directly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://lima-vm.io/" rel="noopener noreferrer"&gt;&lt;strong&gt;Lima&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Headless Linux VMs from declarative YAML&lt;/td&gt;
&lt;td&gt;One YAML &lt;strong&gt;reproduces the same VM any number of times&lt;/strong&gt; , headless, containerd-friendly, native speed on the vz backend ← &lt;strong&gt;chosen&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;As the diagram shows, with k3d the nodes are &lt;strong&gt;containers&lt;/strong&gt; sharing one kernel. It's fast and light, but because the kernels aren't isolated between nodes, it's a step removed from the feel of "operating real nodes." The rest (Multipass, VirtualBox, UTM, Lima) are &lt;strong&gt;VMs, each with its own kernel&lt;/strong&gt; , so isolation is strong. What separates them further is &lt;strong&gt;management style and backend.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Lima (Linux Machines) is an open-source tool for standing up headless Linux VMs on macOS. I chose it for three reasons.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Declarative and reproducible.&lt;/strong&gt; Write the VM's spec (CPU, memory, disk, distro) in YAML, and the same definition mass-produces four identical VMs as-is. That's a different level of reproducibility from clicking through a GUI four times.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Suits headless.&lt;/strong&gt; It runs as an always-on server, so no GUI is needed. Lima runs entirely from the CLI (&lt;code&gt;limactl&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast backend.&lt;/strong&gt; Since v1.0, on macOS (13.5+) Lima uses vz (Apple Virtualization.framework) as the default backend. Running an Intel VM on an Intel Mac this time means native virtualization, not emulation, so it's light. (On older macOS where vz isn't available, you can fall back to &lt;code&gt;vmType: qemu&lt;/code&gt;.)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The third point is grounded in the &lt;a href="https://lima-vm.io/docs/config/vmtype/" rel="noopener noreferrer"&gt;Lima vmType docs&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Lima works
&lt;/h3&gt;

&lt;p&gt;It looks like "a tool for standing up Linux VMs on a Mac," but once you see the structure, it becomes clear why it gets reproducibility and speed at the same time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fga0h6i7b1vp56ikcv0gv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fga0h6i7b1vp56ikcv0gv.png" alt="Lima architecture — limactl brings up guest VMs via vz" width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;limactl&lt;/code&gt; (host CLI)&lt;/strong&gt; — Lima's core. One line, &lt;code&gt;limactl start ./k3s-agent.yaml&lt;/code&gt;, builds and boots a VM exactly to the spec in the YAML. With no GUI, drop it in a script and loop four times to stand up four identical VMs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;lima.yaml&lt;/code&gt; (declarative config)&lt;/strong&gt; — a single file holding the VM's CPU, memory, disk, distro, and provisioning scripts. It's the "VM blueprint," so sharing the same file gets anyone the same VM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vz (Apple Virtualization.framework)&lt;/strong&gt; — the hypervisor that actually runs the VM. It's Lima's default on macOS 13.5+, and since it runs an Intel guest on an Intel host, it runs natively with no emulation. Run &lt;code&gt;systemd-detect-virt&lt;/code&gt; inside a joined VM and you'll see &lt;code&gt;apple&lt;/code&gt; — the proof.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guest VM (Ubuntu)&lt;/strong&gt; — has its own Linux kernel, separate from the host. On top of it run containerd (bundled in k3s) and the k3s agent, and a separately installed Tailscale gives the &lt;code&gt;tailscale0&lt;/code&gt; interface a &lt;code&gt;100.x&lt;/code&gt; address. The VM joins the cloud nodes over this &lt;code&gt;100.x&lt;/code&gt; (§5·§6).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Host ↔ guest links&lt;/strong&gt; — Lima sets up virtiofs file sharing, port-forwarding, and SSH automatically. So you drop straight into the VM from the Mac terminal and exchange files.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;In short: "one blueprint (YAML) → &lt;code&gt;limactl&lt;/code&gt; boots it via vz → a real node with its own kernel." For the conditions "the same machine, many times, by script, lightly," Lima fit best.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  3. Splitting into meaningful units
&lt;/h2&gt;

&lt;p&gt;Once splitting was decided, what's left is the count. There's a reason I went with four rather than two or eight.&lt;/p&gt;

&lt;p&gt;The yardstick was &lt;strong&gt;RAM.&lt;/strong&gt; What caps the node count is memory, not CPU. Several VMs can time-share CPU (oversubscription), but memory can't be — once you allocate it, it's gone.&lt;/p&gt;

&lt;p&gt;Leave about half of the 64GB for the macOS host and headroom, and the VMs' share is around 32GB. How many pieces to split that into?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffskmz4dxuipsdd2r4itr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffskmz4dxuipsdd2r4itr.png" alt="Why 4 nodes — splitting 64GB and the node-count balance" width="800" height="412"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The minimum size of one node.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Slice too finely and the fixed overhead each node reserves for kubelet and the OS starts to stand out. In &lt;a href="https://learnkube.com/kubernetes-node-size" rel="noopener noreferrer"&gt;learnkube's numbers&lt;/a&gt;, a 1 vCPU/4GB node hands over about 1.1GB (28%!) to the system. Around 8GiB that ratio becomes bearable, leaving room to run a meaningful workload on top. So I made &lt;strong&gt;one VM = 8GiB&lt;/strong&gt; the unit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Balancing the count.&lt;/strong&gt; 32GB ÷ 8GiB = 4 nodes. The numbers lined up, and it fits the learning goal too.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Two&lt;/strong&gt; is too few to call multi-node. Lose one and half is gone, and however many replicas you spread, the effective spread is two.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Eight&lt;/strong&gt; shrinks 8GiB to 4GiB, pushing the per-node overhead ratio back up, while eight VMs fight over the host's cores — CPU contention and host-RAM pressure get rough. The more nodes there are, the more the &lt;a href="https://kubernetes.io/docs/setup/best-practices/cluster-large/" rel="noopener noreferrer"&gt;node controller's health-check load&lt;/a&gt; grows too.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Four&lt;/strong&gt; is the lowest count where you can &lt;code&gt;drain&lt;/code&gt; one node and have three carry it, spread Pods with anti-affinity, and practice rolling a node out and back — while still keeping the 8GiB workload unit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why being short on CPU is fine.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I gave each one 3 vCPU, so four make 12 vCPU. That's more than the host's physical core count — honestly, the cores don't add up to that total (oversubscription). It still works because most of this homelab's workloads are &lt;strong&gt;warm-start&lt;/strong&gt; : they sleep on minimal resources when unused and wake on a request. Not every Pod runs at full load at once; they wake and sleep on a stagger, so several VMs sharing the physical cores is no strain in real use. Unlike memory, CPU is time-shared — which is exactly why oversubscription holds up under virtualization.&lt;/p&gt;

&lt;p&gt;To sum up, the initial design is &lt;strong&gt;4 home nodes, each 3 vCPU / 8GiB / 300GiB disk / Ubuntu 24.04 LTS.&lt;/strong&gt; Add #1's two cloud control-plane nodes and you reach the target of six.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Home node (initial, all 4 the same)&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;vCPU&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;8 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disk&lt;/td&gt;
&lt;td&gt;300 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OS&lt;/td&gt;
&lt;td&gt;Ubuntu 24.04 LTS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Role&lt;/td&gt;
&lt;td&gt;k3s agent (workload only)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;One thing up front. The above is the &lt;strong&gt;initial design&lt;/strong&gt; , and three of the four are still like that. But while operating it — trying out a larger app, checking compatibility against a different OS version — I bumped just the &lt;strong&gt;fourth node (agent-4)&lt;/strong&gt; to 4 vCPU / 16GiB and reinstalled it on Ubuntu 25.10. So when you run &lt;code&gt;limactl list&lt;/code&gt; in §4, one machine looks different.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  4. Installing Lima and defining the VM
&lt;/h2&gt;

&lt;p&gt;From here it's hands-on. The order is ① install Lima → ② write the VM blueprint (YAML) → ③ start four → ④ verify. &lt;strong&gt;Every output below was taken on this actual iMac.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4-1. Install Lima (host = the iMac's macOS)
&lt;/h3&gt;

&lt;p&gt;One line with Homebrew.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;brew install lima
limactl --version


limactl version 2.0.3

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Making vz the default backend needs macOS 13.5 or newer (this iMac meets that). On a lower version, change &lt;code&gt;vmType&lt;/code&gt; to &lt;code&gt;qemu&lt;/code&gt; in the YAML below — slower, but it works the same.&lt;/p&gt;

&lt;h3&gt;
  
  
  4-2. The VM blueprint — &lt;code&gt;k3s-agent.yaml&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;To keep the four identical, I freeze the definition into one file. This YAML is the blueprint for all four nodes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# k3s-agent.yaml — shared blueprint for the 4 home nodes (Lima)&lt;/span&gt;
&lt;span class="na"&gt;images&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://cloud-images.ubuntu.com/releases/24.04/release/ubuntu-24.04-server-cloudimg-amd64.img"&lt;/span&gt;
    &lt;span class="na"&gt;arch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x86_64"&lt;/span&gt;
&lt;span class="na"&gt;cpus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
&lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8GiB"&lt;/span&gt;
&lt;span class="na"&gt;disk&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;300GiB"&lt;/span&gt;
&lt;span class="c1"&gt;# vmType omitted → vz is automatic on macOS 13.5+ (use vmType: "qemu" on older macOS)&lt;/span&gt;
&lt;span class="c1"&gt;# mounts/containerd left at defaults. As a node, no host-directory sharing needed.&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The points that matter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;images&lt;/code&gt; — the Ubuntu 24.04 LTS cloud image (&lt;code&gt;x86_64&lt;/code&gt;, since it's Intel). The same image as Lima's default &lt;code&gt;ubuntu-24.04&lt;/code&gt; template. (On Apple Silicon, switch to an &lt;code&gt;arch: "aarch64"&lt;/code&gt; image.)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cpus&lt;/code&gt;/&lt;code&gt;memory&lt;/code&gt;/&lt;code&gt;disk&lt;/code&gt; — the 3 vCPU / 8GiB / 300GiB decided in §3.&lt;/li&gt;
&lt;li&gt;Omitting &lt;code&gt;vmType&lt;/code&gt; is deliberate — on macOS 13.5+, vz is chosen automatically. That's why &lt;code&gt;systemd-detect-virt&lt;/code&gt; reads &lt;code&gt;apple&lt;/code&gt; inside the VM.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note — disk is sparse (thin) allocated, but the part you use is really used.&lt;/strong&gt; &lt;code&gt;disk: 300GiB&lt;/code&gt; is a ceiling, so it only takes the image's size at first, but as four of them fill up they eat a good chunk of the host disk in total. On an HDD especially, leave generous room.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  4-3. Start four
&lt;/h3&gt;

&lt;p&gt;Same blueprint, &lt;strong&gt;only the name changes&lt;/strong&gt; , four times.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;for &lt;/span&gt;n &lt;span class="k"&gt;in &lt;/span&gt;k3s-agent k3s-agent-2 k3s-agent-3 k3s-agent-4&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;limactl start &lt;span class="nt"&gt;--name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; ./k3s-agent.yaml &lt;span class="nt"&gt;--tty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false
&lt;/span&gt;&lt;span class="k"&gt;done&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lima prefixes the instance name with &lt;code&gt;lima-&lt;/code&gt; to make the hostname → &lt;code&gt;lima-k3s-agent&lt;/code&gt;, &lt;code&gt;lima-k3s-agent-2&lt;/code&gt;, … . That name later shows up as-is as the node name in &lt;code&gt;kubectl get nodes&lt;/code&gt;. (&lt;code&gt;--tty=false&lt;/code&gt; is for automation — create it without opening an editor.)&lt;/p&gt;

&lt;h3&gt;
  
  
  4-4. Verify (real output)
&lt;/h3&gt;

&lt;p&gt;Check on the host that four came up.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;limactl list


NAME STATUS SSH CPUS MEMORY DISK DIR
k3s-agent Running 127.0.0.1:61372 3 8GiB 300GiB ~/.lima/k3s-agent
k3s-agent-2 Running 127.0.0.1:61392 3 8GiB 300GiB ~/.lima/k3s-agent-2
k3s-agent-3 Running 127.0.0.1:60490 3 8GiB 300GiB ~/.lima/k3s-agent-3
k3s-agent-4 Running 127.0.0.1:61460 4 16GiB 300GiB ~/.lima/k3s-agent-4

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All four are &lt;code&gt;Running&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Drop into a VM and check its spec, virtualization backend, and OS.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;limactl shell k3s-agent -- nproc
limactl shell k3s-agent -- systemd-detect-virt
limactl shell k3s-agent -- free -h
limactl shell k3s-agent -- grep PRETTY_NAME /etc/os-release


3
apple
               total used free shared buff/cache available
Mem: 7.8Gi 2.3Gi 744Mi 232Mi 5.3Gi 5.5Gi
Swap: 0B 0B 0B
PRETTY_NAME="Ubuntu 24.04.3 LTS"

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3 vCPU, ~7.8GiB of memory, &lt;code&gt;systemd-detect-virt&lt;/code&gt; reading &lt;code&gt;apple&lt;/code&gt; (proof it's on vz), Ubuntu 24.04.3 LTS. (&lt;code&gt;used&lt;/code&gt;/&lt;code&gt;buff/cache&lt;/code&gt; are already loaded because this node is running k3s workloads right now — right after creation it'd be nearly empty.)&lt;/p&gt;

&lt;p&gt;Only &lt;code&gt;k3s-agent-4&lt;/code&gt;, swapped during operation, has a different OS.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;limactl shell k3s-agent-4 -- grep PRETTY_NAME /etc/os-release
limactl shell k3s-agent-4 -- nproc


PRETTY_NAME="Ubuntu 25.10"
4

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From the cluster's point of view, a different OS version is no problem — as long as the k3s version and container runtime line up (confirmed in §7).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;At this point four Ubuntu VMs are up on the iMac (three at the initial spec, one beefed up for testing). They're still just four Linux machines with nothing to do with the cluster. Next is &lt;strong&gt;Tailscale&lt;/strong&gt; (§5), which binds these four and the Tokyo cloud nodes into one private network.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  5. Tailscale — binding the four into one private network with the Tokyo cluster
&lt;/h2&gt;

&lt;p&gt;What §4 produced is four empty Ubuntu VMs on the iMac. To bind them with the Tokyo cloud nodes, there's a problem to solve first.&lt;/p&gt;

&lt;p&gt;Home has no global IP. It's behind the router's NAT, so you can't open a connection from outside (the cloud) to a home VM first. You could punch a hole with port-forwarding + DDNS, but I'd rather not touch the router and expose the home IP to the internet. And the answer to this was already decided back in #1 — &lt;strong&gt;Tailscale.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Tailscale is a WireGuard-based mesh VPN. Every machine &lt;strong&gt;dials outbound&lt;/strong&gt; , so even with both sides behind NAT they connect directly (falling back to a DERP relay if the direct path fails), and each machine gets a fixed private address in the &lt;code&gt;100.64.0.0/10&lt;/code&gt; range. Since #1 already bound the two cloud nodes this way, this time it's just adding the four home VMs to the same tailnet.&lt;/p&gt;

&lt;p&gt;Why install Tailscale per VM? Because the k3s agent uses a Tailscale &lt;code&gt;100.x&lt;/code&gt; as the address it advertises itself on (&lt;code&gt;--node-ip&lt;/code&gt;) (confirmed with the real flags in §6). Each node needs one fixed &lt;code&gt;100.x&lt;/code&gt;, so Tailscale goes on each of the four VMs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft83eh90ei4rveiuxbptq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft83eh90ei4rveiuxbptq.png" alt="One tailnet — Tokyo cloud and Sapporo home on one private network" width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  5-1. Install Tailscale on each VM
&lt;/h3&gt;

&lt;p&gt;Same one line on all four.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://tailscale.com/install.sh | sh

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Either run it by dropping into each VM with &lt;code&gt;limactl shell&lt;/code&gt;, or put it in the provisioning script of §4-2's &lt;code&gt;lima.yaml&lt;/code&gt; so it installs automatically when the VM is created.&lt;/p&gt;

&lt;h3&gt;
  
  
  5-2. Auth — an auth key, since it's headless
&lt;/h3&gt;

&lt;p&gt;The VMs have no browser, so instead of interactive login I authenticate non-interactively with an &lt;strong&gt;auth key.&lt;/strong&gt; Issue one &lt;strong&gt;Reusable&lt;/strong&gt; key in the Tailscale admin console under &lt;strong&gt;Settings → Keys&lt;/strong&gt; (reusable so the same key works for all four) and a &lt;code&gt;tskey-auth-…&lt;/code&gt; string appears. It's shown only once, so copy it right then.&lt;/p&gt;

&lt;p&gt;On each VM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;tailscale up &lt;span class="nt"&gt;--auth-key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;tskey-auth-XXXXXXXX...

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I joined them on a &lt;strong&gt;tagless personal account&lt;/strong&gt; — keeping it simple. To lock access down further, you can layer on tags (like &lt;code&gt;tag:server&lt;/code&gt;) and an ACL policy.&lt;/p&gt;

&lt;h3&gt;
  
  
  5-3. Verify — six nodes on one tailnet
&lt;/h3&gt;

&lt;p&gt;Check that each VM got a &lt;code&gt;100.x&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;limactl shell k3s-agent -- tailscale ip -4


100.84.x.x

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look at the whole tailnet.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;limactl shell k3s-agent -- tailscale status


100.84.x.x lima-k3s-agent me@… linux -
&lt;/span&gt;&lt;span class="gp"&gt;100.98.x.x lima-k3s-agent-2 me@… linux active;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;direct
&lt;span class="gp"&gt;100.117.x.x lima-k3s-agent-3 me@… linux active;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;direct
&lt;span class="gp"&gt;100.90.x.x lima-k3s-agent-4 me@… linux active;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;direct
&lt;span class="gp"&gt;100.71.x.x ip-172-26-3-146 me@… linux active;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;direct &lt;span class="c"&gt;# cloud server-A (#1)&lt;/span&gt;
&lt;span class="gp"&gt;100.99.x.x ip-172-26-2-70 me@… linux active;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;direct &lt;span class="c"&gt;# cloud server-B (#1)&lt;/span&gt;
&lt;span class="gp"&gt;100.88.x.x seon-mbp-m4 me@… macOS - #&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;work laptop
&lt;span class="go"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Other same-account devices (the iMac host, NAS, and so on) show up too, but I trimmed them above. &lt;code&gt;active; direct&lt;/code&gt; is the mark of a &lt;strong&gt;direct&lt;/strong&gt; connection with no DERP relay (NAT traversal succeeded). With that, the four home VMs and the two Tokyo cloud boxes reach each other by &lt;code&gt;100.x&lt;/code&gt; inside one tailnet.&lt;/p&gt;

&lt;h3&gt;
  
  
  5-4. Firewall — Tailscale opens zero new public ports
&lt;/h3&gt;

&lt;p&gt;Let's stop here a moment and check: "did I just expose anything extra to the internet?"&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8hp33ag1nmjmdlc0xv9z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8hp33ag1nmjmdlc0xv9z.png" alt="Firewall — adding Tailscale opens zero new public ports" width="800" height="412"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Tailscale only uses outbound connections (UDP 41641, or DERP over 443 if that's blocked). So even with Tailscale on the VMs, the number of &lt;strong&gt;new inbound firewall ports to open is zero.&lt;/strong&gt; It extends only the private network, without widening the exposed surface. The orthodox way is to keep cluster ports like the apiserver (6443) off the public net and reachable &lt;strong&gt;only inside the tailnet&lt;/strong&gt; (the laptop's connection goes over the tailnet too, in §7).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note — "exactly zero public ports" is something I mean to finish in a later article.&lt;/strong&gt; The apiserver (6443) can be closed on the public side the moment you switch to tailnet access. But closing &lt;strong&gt;inter-node Pod traffic (flannel VXLAN, UDP 8472)&lt;/strong&gt; on the public net too needs extra config to run flannel over Tailscale (otherwise Pod-to-Pod traffic between nodes breaks — see the note in §6). That work and its limits (the overhead of double encapsulation, and so on) come later. &lt;strong&gt;What's certain in this article is "Tailscale itself adds no exposure at all."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Something I learned from running it: this is where a lot of time goes. Even within this article's scope, you end up opening a port to get Pods talking to each other. And even with the port open, the limited speed and latency of the Tailscale overlay make it a real struggle. Solving it is the next article's topic.&lt;/p&gt;

&lt;p&gt;At this point six machines (2 cloud + 4 home) call each other by a fixed &lt;code&gt;100.x&lt;/code&gt; on one private network. They aren't bound into a single cluster yet — in the next §6, that &lt;code&gt;100.x&lt;/code&gt; goes straight into the k3s agent join.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  6. Agent join — putting the Tailscale 100.x straight into node-ip
&lt;/h2&gt;

&lt;p&gt;This is the article's goal. The four VMs now on one tailnet &lt;strong&gt;join as agents&lt;/strong&gt; to the cloud cluster stood up in #1.&lt;/p&gt;

&lt;p&gt;The method is almost the same as #1's server install. You pass the k3s install script two environment variables (&lt;code&gt;K3S_URL&lt;/code&gt;, &lt;code&gt;K3S_TOKEN&lt;/code&gt;) and the agent flags. The key is &lt;strong&gt;setting the address the node advertises itself on (&lt;code&gt;--node-ip&lt;/code&gt;) to the Tailscale &lt;code&gt;100.x&lt;/code&gt; you got in §5.&lt;/strong&gt; That gets the node to the apiserver over the tailnet and into the cluster.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnmgrek5q1v5t34orsro8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnmgrek5q1v5t34orsro8.png" alt="Agent join — putting the Tailscale 100.x into node-ip" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  6-1. Get the join token from the server
&lt;/h3&gt;

&lt;p&gt;The token an agent uses to join lives on the server (#1's cluster-init node).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo cat&lt;/span&gt; /var/lib/rancher/k3s/server/node-token


K10&amp;lt;&lt;span class="nb"&gt;hash&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;::server:&amp;lt;random&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Copy this value (the &lt;code&gt;K3S_TOKEN&lt;/code&gt; below). The token's location and meaning are laid out in the &lt;a href="https://docs.k3s.io/cli/token" rel="noopener noreferrer"&gt;k3s token docs&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  6-2. Join each VM as an agent
&lt;/h3&gt;

&lt;p&gt;Run on each of the four VMs. &lt;code&gt;K3S_URL&lt;/code&gt; is the &lt;strong&gt;server's Tailscale address&lt;/strong&gt; ; &lt;code&gt;--node-ip&lt;/code&gt;/&lt;code&gt;--node-external-ip&lt;/code&gt; are &lt;strong&gt;that VM's Tailscale &lt;code&gt;100.x&lt;/code&gt;.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-sfL&lt;/span&gt; https://get.k3s.io | &lt;span class="nv"&gt;INSTALL_K3S_VERSION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;v1.34.3+k3s1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;K3S_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://100.71.x.x:6443 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;K3S_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;K10&amp;lt;&lt;span class="nb"&gt;hash&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;::server:&amp;lt;random&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  sh &lt;span class="nt"&gt;-s&lt;/span&gt; - agent &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--node-ip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;this VM&lt;span class="s1"&gt;'s 100.x&amp;gt; \
    --node-external-ip=&amp;lt;this VM'&lt;/span&gt;s 100.x&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Breaking down the flags:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;K3S_URL=https://100.71.x.x:6443&lt;/code&gt; — the apiserver on the server (#1 cluster-init). It's a Tailscale address, so it's reachable inside the tailnet even with 6443 closed on the public net.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;K3S_TOKEN&lt;/code&gt; — the value from 6-1. &lt;strong&gt;With both URL and TOKEN present&lt;/strong&gt; , the k3s install script installs as an &lt;strong&gt;agent&lt;/strong&gt; , not a server.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--node-ip=100.x&lt;/code&gt; / &lt;code&gt;--node-external-ip=100.x&lt;/code&gt; — use the Tailscale address as this node's InternalIP / externally advertised address.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Peek at a node that actually joined and you'll see the same thing written there (real, token masked):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/systemd/system/k3s-agent.service
&lt;/span&gt;&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/local/bin/k3s agent --node-external-ip=100.x.x.x --node-ip=100.x.x.x&lt;/span&gt;

&lt;span class="c"&gt;# /etc/systemd/system/k3s-agent.service.env
&lt;/span&gt;&lt;span class="py"&gt;K3S_URL&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'https://100.71.x.x:6443'&lt;/span&gt;
&lt;span class="py"&gt;K3S_TOKEN&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;********&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The flag meanings are in the &lt;a href="https://docs.k3s.io/cli/agent" rel="noopener noreferrer"&gt;k3s agent reference&lt;/a&gt;. The CNI carries over the flannel &lt;strong&gt;vxlan&lt;/strong&gt; decided on the server in #1 as-is, and all six nodes are pinned to k3s version &lt;code&gt;v1.34.3+k3s1&lt;/code&gt; (confirmed in §7).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note — this article goes only as far as "node join (Ready)." Inter-node &lt;em&gt;Pod&lt;/em&gt; traffic is a later networking story.&lt;/strong&gt; With just &lt;code&gt;--node-ip=100.x&lt;/code&gt;, a node reaches the apiserver over the tailnet and goes &lt;code&gt;Ready&lt;/code&gt; (which is this article's goal: "the iMac joins as nodes"). But getting &lt;strong&gt;Pods on different nodes&lt;/strong&gt; to talk requires flannel VXLAN to cross between nodes, and there's a trap here.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;By default flannel advertises each node's &lt;strong&gt;default-route interface IP&lt;/strong&gt; as the VXLAN destination (public-ip). Cloud nodes have that as a VPC private network (e.g., &lt;code&gt;172.26.x&lt;/code&gt;), home VMs as their own private network — so &lt;strong&gt;they sit on different underlays and may not reach each other directly.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;So you need config to "run VXLAN over the tailnet (&lt;code&gt;100.x&lt;/code&gt;)" (like &lt;code&gt;--flannel-iface=tailscale0&lt;/code&gt;), but &lt;strong&gt;you must not change just one side (the agents).&lt;/strong&gt; Setting it on agents only makes them disagree with the server's destination (VPC), and it &lt;strong&gt;breaks asymmetrically — traffic passes one way only.&lt;/strong&gt; You have to align both server and agents (= touching #1's server config too).&lt;/li&gt;
&lt;li&gt;This — "putting flannel properly on the tailnet + the overhead and limits of double encapsulation (VXLAN over WireGuard) + the optimization" — is the next article's topic. So this time I stop at nodes joining and going &lt;code&gt;Ready&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  6-3. (optional) Automate it in lima.yaml
&lt;/h3&gt;

&lt;p&gt;Instead of repeating by hand on four VMs, put the Tailscale install (§5) and the agent join above into the provisioning script of §4's &lt;code&gt;lima.yaml&lt;/code&gt;, and one &lt;code&gt;limactl start&lt;/code&gt; carries it all the way to a finished node. The §2 promise of "one blueprint, four identical nodes" comes full circle here.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Verifying from the laptop — six nodes in one cluster
&lt;/h2&gt;

&lt;p&gt;Following the orthodox approach, the apiserver (6443) isn't open on the public net and is reached &lt;strong&gt;only inside the tailnet.&lt;/strong&gt; So to check, the laptop joins the same tailnet and the kubeconfig is pointed at the server's tailnet address. (The principle is the same whether the laptop is Mac, Windows, or Linux.)&lt;/p&gt;

&lt;h3&gt;
  
  
  7-1. Put the laptop on the tailnet (Mac / Windows / Linux)
&lt;/h3&gt;

&lt;p&gt;Install the same Tailscale, on the same account, on the laptop too.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;macOS&lt;/strong&gt; — the GUI app via &lt;code&gt;brew install --cask tailscale-app&lt;/code&gt; (or the App Store), then log in.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Windows&lt;/strong&gt; — the installer from &lt;a href="https://tailscale.com/download" rel="noopener noreferrer"&gt;tailscale.com/download&lt;/a&gt;, then log in.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Linux&lt;/strong&gt; — &lt;code&gt;curl -fsSL https://tailscale.com/install.sh | sh&lt;/code&gt; → &lt;code&gt;sudo tailscale up&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once you log in, the laptop gets a &lt;code&gt;100.x&lt;/code&gt; and joins the same tailnet as the nodes. If &lt;code&gt;tailscale status&lt;/code&gt; shows the cloud and home nodes, you're ready.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note — the Homebrew package names are confusing.&lt;/strong&gt; The macOS GUI app (menu bar) is the cask &lt;code&gt;tailscale-app&lt;/code&gt;. &lt;code&gt;brew install tailscale&lt;/code&gt; (the formula) installs only the CLI (&lt;code&gt;tailscaled&lt;/code&gt;). To log in via the GUI on the laptop, use &lt;code&gt;brew install --cask tailscale-app&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  7-2. Point kubeconfig at the tailnet address
&lt;/h3&gt;

&lt;p&gt;The server's (#1 cluster-init) kubeconfig is at &lt;code&gt;/etc/rancher/k3s/k3s.yaml&lt;/code&gt;, with the &lt;code&gt;server&lt;/code&gt; field defaulting to &lt;code&gt;https://127.0.0.1:6443&lt;/code&gt;. Bring this file to the laptop (e.g., &lt;code&gt;~/.kube/config&lt;/code&gt;) and change &lt;code&gt;server&lt;/code&gt; to the &lt;strong&gt;server's Tailscale address.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl config set-cluster default &lt;span class="nt"&gt;--server&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://100.71.x.x:6443

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note — TLS won't pass unless the cert SAN matches.&lt;/strong&gt; Even after pointing &lt;code&gt;server&lt;/code&gt; at the tailnet, if that &lt;code&gt;100.x&lt;/code&gt; isn't in the apiserver cert's SAN you'll be rejected with &lt;code&gt;x509: certificate is valid for ... not ...&lt;/code&gt;. When k3s brings up the server with &lt;code&gt;--node-ip=100.x&lt;/code&gt;, it includes that address (the InternalIP) in the cert SAN automatically — peek at the cert and the &lt;code&gt;100.x&lt;/code&gt; is there:&lt;/p&gt;


&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X509v3 Subject Alternative Name:
DNS:kubernetes, DNS:kubernetes.default, ..., IP Address:10.43.0.1,
IP Address:100.71.x.x, IP Address:100.99.x.x, IP Address:127.0.0.1, ...

&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;If the &lt;code&gt;100.x&lt;/code&gt; isn't in the SAN, add &lt;code&gt;--tls-san=100.71.x.x&lt;/code&gt; on the server and re-issue the cert (&lt;a href="https://docs.k3s.io/cli/server" rel="noopener noreferrer"&gt;k3s server docs&lt;/a&gt;).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  7-3. Verify — &lt;code&gt;kubectl get nodes&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Check it straight from the laptop. The first things to check are &lt;strong&gt;whether all six are &lt;code&gt;Ready&lt;/code&gt;&lt;/strong&gt; , and &lt;strong&gt;whether the agents' INTERNAL-IP is &lt;code&gt;100.x&lt;/code&gt;.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;kubectl get nodes -o wide


NAME STATUS ROLES AGE VERSION INTERNAL-IP OS-IMAGE CONTAINER-RUNTIME
ip-172-26-2-70… Ready control-plane,etcd 140d v1.34.3+k3s1 100.99.x.x Amazon Linux 2023 containerd://2.1.5-k3s1
ip-172-26-3-146… Ready control-plane,etcd 140d v1.34.3+k3s1 100.71.x.x Amazon Linux 2023 containerd://2.1.5-k3s1
&lt;/span&gt;&lt;span class="gp"&gt;lima-k3s-agent Ready &amp;lt;none&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;140d v1.34.3+k3s1 100.84.x.x Ubuntu 24.04.3 LTS containerd://2.1.5-k3s1
&lt;span class="gp"&gt;lima-k3s-agent-2 Ready &amp;lt;none&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;140d v1.34.3+k3s1 100.98.x.x Ubuntu 24.04.3 LTS containerd://2.1.5-k3s1
&lt;span class="gp"&gt;lima-k3s-agent-3 Ready &amp;lt;none&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;140d v1.34.3+k3s1 100.117.x.x Ubuntu 24.04.3 LTS containerd://2.1.5-k3s1
&lt;span class="gp"&gt;lima-k3s-agent-4 Ready &amp;lt;none&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;140d v1.34.3+k3s1 100.90.x.x Ubuntu 25.10 containerd://2.1.5-k3s1
&lt;span class="go"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How to read it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The two cloud nodes are &lt;code&gt;control-plane,etcd&lt;/code&gt; (#1); the four home nodes are &lt;code&gt;ROLES &amp;lt;none&amp;gt;&lt;/code&gt; — agents, workload only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Every node's INTERNAL-IP is &lt;code&gt;100.x&lt;/code&gt; (Tailscale).&lt;/strong&gt; That's the sign the node joined over the tailnet. If a LAN IP like &lt;code&gt;192.168.x&lt;/code&gt; shows up here, the node advertised itself wrong — so always check this column right after a join.&lt;/li&gt;
&lt;li&gt;Version &lt;code&gt;v1.34.3+k3s1&lt;/code&gt; and runtime &lt;code&gt;containerd://2.1.5-k3s1&lt;/code&gt; are the same across all six. Only &lt;code&gt;lima-k3s-agent-4&lt;/code&gt; has a different OS (&lt;code&gt;Ubuntu 25.10&lt;/code&gt;, that test node from §4), but since the k3s version and runtime match, it's no problem for joining.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Label the home nodes to tell them apart from the cloud (real):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;kubectl get nodes -L node-type


... ip-172-26-2-70 ... lightsail
... ip-172-26-3-146 ... lightsail
... lima-k3s-agent ... lima
... lima-k3s-agent-2 ... lima
... lima-k3s-agent-3 ... lima
... lima-k3s-agent-4 ... lima

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When I later route workloads to home/cloud, I use this &lt;code&gt;node-type&lt;/code&gt; label in a &lt;code&gt;nodeSelector&lt;/code&gt; (placement strategy in a later installment).&lt;/p&gt;

&lt;p&gt;By the way, in &lt;strong&gt;the current operating state, with the networking finished&lt;/strong&gt; , Pods are spread across the six nodes like this (this is how it runs now, after the improvements — not right after this step):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ip-172-26-2-70 68 (cloud)
ip-172-26-3-146 14 (cloud)
lima-k3s-agent 26 (home)
lima-k3s-agent-2 18 (home)
lima-k3s-agent-3 25 (home)
lima-k3s-agent-4 96 (home, the larger-app test node)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pods running on both the cloud and home nodes — that's what a hybrid cluster looks like.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;With this, the home iMac became a worker in the cluster. Four empty Lima VMs → one private network with Tailscale → joined as k3s agents → &lt;strong&gt;2 in Tokyo + 4 in Sapporo = all 6 nodes &lt;code&gt;Ready&lt;/code&gt;.&lt;/strong&gt; The apiserver (6443) is seen only over the tailnet, by both the nodes and the laptop.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  8. Cost — the increase was zero
&lt;/h2&gt;

&lt;p&gt;The new cost this round is effectively zero. I added four nodes, but everything I used was either free or already on hand.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Cost (USD)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lightsail server-A (8GB)&lt;/td&gt;
&lt;td&gt;$44 / mo (unchanged from #1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lightsail server-B (16GB)&lt;/td&gt;
&lt;td&gt;$84 / mo (unchanged from #1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;k3s / Lima&lt;/td&gt;
&lt;td&gt;$0 (open source)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tailscale Personal&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Home nodes (Lima VM ×4, iMac)&lt;/td&gt;
&lt;td&gt;$0 (iMac I already had)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;This round's increase&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+$0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One at a time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloud&lt;/strong&gt; — the same two Lightsail boxes from #1. Instances added this round: zero → cloud increase $0.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;k3s · Lima&lt;/strong&gt; — both open source. Adding nodes costs no license fee.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tailscale&lt;/strong&gt; — the personal plan is free. Six nodes + a laptop is nowhere near the free-tier limit → $0.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Home nodes&lt;/strong&gt; — reused the idle iMac → zero new purchase.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note — I'm deliberately leaving electricity out as a number.&lt;/strong&gt; The iMac runs 24/7, so power really does cost something. But that cost swings widely by region (Sapporo, Tokyo, and each reader's own country), contract plan, and season/usage — so putting one dollar figure on it would be wrong for most readers. So I don't quantify it — measuring your own draw (W) with a smart plug is the most accurate. The point is that the &lt;strong&gt;cloud and software increase is zero&lt;/strong&gt; , and the only real added cost is "electricity for a machine I already own."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  9. Glossary — what came up this time
&lt;/h2&gt;

&lt;p&gt;A quick sweep of the terms.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lima / &lt;code&gt;limactl&lt;/code&gt;&lt;/strong&gt; — an open-source tool for standing up headless Linux VMs on macOS from declarative YAML. &lt;code&gt;limactl&lt;/code&gt; is its CLI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vz (Apple Virtualization.framework)&lt;/strong&gt; — macOS's built-in hypervisor. Lima's default on 13.5+. If &lt;code&gt;systemd-detect-virt&lt;/code&gt; reads &lt;code&gt;apple&lt;/code&gt; in the guest, it's on vz.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guest VM vs container&lt;/strong&gt; — a VM has its own kernel and strong isolation; a container shares the host kernel and has weak isolation. That's why this article chose VMs for "real nodes."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;k3s server / agent&lt;/strong&gt; — a server is the control plane (+etcd); an agent is a workload-only node. Pass &lt;code&gt;K3S_URL&lt;/code&gt;+&lt;code&gt;K3S_TOKEN&lt;/code&gt; at install and it joins as an agent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;tailnet / &lt;code&gt;100.x&lt;/code&gt;&lt;/strong&gt; — the private mesh network Tailscale builds. Each device gets a fixed address in the &lt;code&gt;100.64.0.0/10&lt;/code&gt; (CGNAT) range.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WireGuard / DERP&lt;/strong&gt; — WireGuard is Tailscale's VPN engine; when a direct path isn't possible, it detours through a DERP relay.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;auth key (Reusable)&lt;/strong&gt; — a key for joining the tailnet non-interactively, without a browser. Reusable across multiple nodes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;flannel / VXLAN&lt;/strong&gt; — flannel is k3s's default CNI, VXLAN its default backend. It carries inter-node Pod packets encapsulated in UDP (8472).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--node-ip&lt;/code&gt; / InternalIP&lt;/strong&gt; — the address a node advertises to the cluster. Put a Tailscale &lt;code&gt;100.x&lt;/code&gt; here and the node joins over the tailnet (putting Pod-to-Pod traffic on the tailnet is separate config — next time).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;node-token&lt;/strong&gt; — the secret an agent uses to join. On the server at &lt;code&gt;/var/lib/rancher/k3s/server/node-token&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;node-type&lt;/code&gt; label / nodeSelector&lt;/strong&gt; — a label on the nodes (here, lima/lightsail). Used later in a &lt;code&gt;nodeSelector&lt;/code&gt; to route workloads to home/cloud.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  10. Next
&lt;/h2&gt;

&lt;p&gt;With six nodes in place, the next thing is what to put on top of them.&lt;/p&gt;

&lt;p&gt;At first I got carried away and threw all kinds of things on — and the traffic wouldn't go through. I ended up ignoring the principle and opening port 8472 (udp / flannel VXLAN) to make communication work, and ran it that way. But the real trouble started once I brought in Longhorn and CNPG: latency on inter-node traffic set off a cascade of errors, with pods restarting over and over, and countless rounds of trial and error.&lt;/p&gt;

&lt;p&gt;That's what I want to get into next time.&lt;/p&gt;

&lt;p&gt;Thanks for reading all the way through.&lt;/p&gt;

</description>
      <category>infrastructure</category>
      <category>kubernetes</category>
      <category>sideprojects</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Hybrid k3s #1: Cloud and home into one cluster — initial setup</title>
      <dc:creator>SEON</dc:creator>
      <pubDate>Tue, 02 Jun 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/seon/hybrid-k3s-1-cloud-and-home-into-one-cluster-initial-setup-72i</link>
      <guid>https://dev.to/seon/hybrid-k3s-1-cloud-and-home-into-one-cluster-initial-setup-72i</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.seon.world%2Fimages%2Fk3s-1%2Fk3s-1-master-en-wm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.seon.world%2Fimages%2Fk3s-1%2Fk3s-1-master-en-wm.png" alt="Hybrid k3s — current architecture" width="800" height="592"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  0. About this series
&lt;/h2&gt;

&lt;p&gt;This series is a record — written one piece at a time — of how I actually built the homelab shown in the diagram above, the one I'm running right now.&lt;/p&gt;

&lt;p&gt;What started as a toy project from a simple "could this even work?" turned, through satisfying performance and endless tearing-down-and-rebuilding, into a genuine toy that relieves the stress built up at work.&lt;/p&gt;

&lt;p&gt;It isn't a resource-rich cluster, but it has been more than enough to get a real taste of Kubernetes, and it keeps giving me new things I want to try next.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;6 nodes&lt;/strong&gt; — 2 Lightsail &lt;strong&gt;servers&lt;/strong&gt; (control plane + etcd) in the cloud (AWS Tokyo) + 4 &lt;strong&gt;Lima VM agents&lt;/strong&gt; on a home (Sapporo) iMac&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;19 vCPU / 61 GiB&lt;/strong&gt; total, &lt;strong&gt;49 namespaces&lt;/strong&gt; , &lt;strong&gt;248 pods (150 running)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Deployment via &lt;strong&gt;ArgoCD&lt;/strong&gt; , authentication via &lt;strong&gt;Keycloak OIDC&lt;/strong&gt; , with CloudNativePG, Vault, CrowdSec, Prometheus/Grafana, and more running on top&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It wasn't easy, but it wasn't hard enough to give up on either — so I'm going to write up, one at a time, the things I learned while building it and the things I want to keep.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This first story is about the foundation — how I started from &lt;strong&gt;two control-plane nodes in the cloud&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  1. Background
&lt;/h2&gt;

&lt;p&gt;There was no grand blueprint to begin with. The starting point was ordinary.&lt;/p&gt;

&lt;p&gt;Working with Kubernetes in my day job, things I want to dig into more keep coming up. Reading the docs is one thing; breaking and fixing a cluster with my own hands is another. There's an environment I can touch at work too, but it's limited, and a careless mistake there leads to noisy, annoying situations — so there were limits.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;I needed a cluster I could run however I wanted.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As it happened, a &lt;strong&gt;64GB-RAM iMac&lt;/strong&gt; , more than 10 years old, was sitting mostly idle at home. It still performs well enough, but it has an HDD so it's slow, its OS is past end-of-support, and it has handed its seat to a MacBook Pro M4 and is now resting. On the cloud side, I already had &lt;strong&gt;two small Lightsail instances&lt;/strong&gt; running personal services, and as those services grew, resources were gradually getting tight.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"What if I stopped keeping the idle home machine's resources and the cloud I'm already paying for separate, and used them as one?"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The urge to learn and the pressure on resources converged on a single idea — &lt;strong&gt;combine the cloud and home into one cluster.&lt;/strong&gt; This article is the first dig: building the cloud-side foundation.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Why k3s — a choice under limited resources
&lt;/h2&gt;

&lt;p&gt;First, let's prepare a Kubernetes (k8s) environment.&lt;/p&gt;

&lt;p&gt;But for the resources I had in my cloud environment, standard k8s was too heavy. In my dreams I wanted to run wild on a multi-cluster with thousands of nodes; in reality it was a small AWS Lightsail instance of about $150/month and a single 10-plus-year-old iMac near retirement.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I had to pick "which Kubernetes to go with" first. Here's what my research turned up.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Option&lt;/th&gt;
&lt;th&gt;Character&lt;/th&gt;
&lt;th&gt;For this situation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Managed (EKS/GKE/AKS)&lt;/td&gt;
&lt;td&gt;The cloud runs the control plane for you&lt;/td&gt;
&lt;td&gt;Control-plane fee + node cost → conflicts with low cost / reusing idle gear, excluded&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vanilla Kubernetes (kubeadm)&lt;/td&gt;
&lt;td&gt;Assemble upstream yourself&lt;/td&gt;
&lt;td&gt;The most orthodox but heavy and hands-on → a burden for low-spec/small scale, excluded&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;k3s&lt;/strong&gt; (Rancher/SUSE)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Single-binary lightweight distro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Lightweight distro — &lt;strong&gt;finalist&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;k0s · MicroK8s&lt;/td&gt;
&lt;td&gt;Lightweight distros of a similar kind&lt;/td&gt;
&lt;td&gt;Likewise lightweight distros — &lt;strong&gt;finalist&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;minikube · kind&lt;/td&gt;
&lt;td&gt;For local dev/testing&lt;/td&gt;
&lt;td&gt;Not meant for persistent multi-node operation → excluded&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Filtering this way, the &lt;strong&gt;candidates narrowed to three lightweight distros: k3s, k0s, and MicroK8s.&lt;/strong&gt; Digging deeper into the three:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;k3s (chosen)&lt;/th&gt;
&lt;th&gt;k0s&lt;/th&gt;
&lt;th&gt;MicroK8s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Maker&lt;/td&gt;
&lt;td&gt;Rancher/SUSE&lt;/td&gt;
&lt;td&gt;Mirantis&lt;/td&gt;
&lt;td&gt;Canonical&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Packaging&lt;/td&gt;
&lt;td&gt;Single binary&lt;/td&gt;
&lt;td&gt;Single binary&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;snap package&lt;/strong&gt; (depends on snapd)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Default datastore&lt;/td&gt;
&lt;td&gt;SQLite (kine); embedded etcd for HA&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;etcd standard&lt;/strong&gt; (kine for other DBs too)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;dqlite&lt;/strong&gt; (distributed SQLite, Raft)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HA approach&lt;/td&gt;
&lt;td&gt;Switches to etcd with multiple servers&lt;/td&gt;
&lt;td&gt;Provided by default&lt;/td&gt;
&lt;td&gt;Automatic HA at 3+ nodes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Control plane&lt;/td&gt;
&lt;td&gt;server also runs workloads&lt;/td&gt;
&lt;td&gt;Internal components as separate processes, &lt;strong&gt;control-plane isolation&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Per node&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Default CNI&lt;/td&gt;
&lt;td&gt;flannel (lightweight, limited policy)&lt;/td&gt;
&lt;td&gt;kube-router/calico&lt;/td&gt;
&lt;td&gt;calico (HA variant)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bundling&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Essential components included&lt;/strong&gt; (Traefik, ServiceLB, local-path…)&lt;/td&gt;
&lt;td&gt;Minimal, easy to swap default components&lt;/td&gt;
&lt;td&gt;Enable add-ons with &lt;code&gt;microk8s enable&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why k3s.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;All three are CNCF-compliant lightweight distros, but they differ in character.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;k0s&lt;/strong&gt; keeps the control plane separate from workloads, which is clean, but it ships with fewer things, so there's more to plug in yourself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MicroK8s&lt;/strong&gt; has the convenience of enabling add-ons with a single &lt;code&gt;microk8s enable&lt;/code&gt; line, but in return it's tied to snap, and there are reported cases of dqlite CPU/consensus instability on write-heavy clusters. (&lt;a href="https://github.com/canonical/microk8s/issues/3227" rel="noopener noreferrer"&gt;GitHub Issue #3227&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;k3s&lt;/strong&gt; , on the other hand, has essential components bundled into a single binary, so the initial setup is the fastest, and the path of moving to embedded etcd with multiple servers fits naturally with this kind of "cloud + home HA." Add low-spec/ARM support and the depth of its docs and community, and for the goal of learning and low-cost operation at once, k3s fit best. (comparison sources: &lt;a href="https://palark.com/blog/small-local-kubernetes-comparison/" rel="noopener noreferrer"&gt;Palark&lt;/a&gt; · &lt;a href="https://www.portainer.io/blog/k0s-vs-k3s" rel="noopener noreferrer"&gt;Portainer&lt;/a&gt; · &lt;a href="https://www.nops.io/blog/k0s-vs-k3s-vs-k8s/" rel="noopener noreferrer"&gt;nOps&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;k3s repackages that Kubernetes &lt;strong&gt;as a single binary (under 100MB) while staying 100% compatible (CNCF certified).&lt;/strong&gt; Its requirements are essentially just a modern kernel + cgroups, so it's no strain even on low-spec hardware. (&lt;a href="https://docs.k3s.io/" rel="noopener noreferrer"&gt;What is K3s&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Just three reasons it's light:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Single binary, single process.&lt;/strong&gt; Components that run separately in regular Kubernetes — &lt;code&gt;kube-apiserver&lt;/code&gt;, &lt;code&gt;kube-scheduler&lt;/code&gt;, &lt;code&gt;kube-controller-manager&lt;/code&gt;, &lt;code&gt;kubelet&lt;/code&gt;, &lt;code&gt;kube-proxy&lt;/code&gt; — are wrapped into one &lt;code&gt;k3s&lt;/code&gt; process, with the containerd runtime built in. (&lt;a href="https://docs.k3s.io/architecture" rel="noopener noreferrer"&gt;Architecture&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible datastore.&lt;/strong&gt; A single server uses SQLite by default; &lt;strong&gt;with multiple servers, embedded etcd is selected automatically&lt;/strong&gt; (external MySQL/Postgres are also possible). (&lt;a href="https://docs.k3s.io/datastore" rel="noopener noreferrer"&gt;Datastore&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Essential components included.&lt;/strong&gt; flannel (CNI), CoreDNS, Traefik (Ingress), ServiceLB, local-path (storage), and metrics-server are brought up together at install time. That's that much less to assemble yourself.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As a bonus, k3s nodes come in two kinds — &lt;strong&gt;server&lt;/strong&gt; (control plane + datastore) and &lt;strong&gt;agent&lt;/strong&gt; (workload only) — which made it a good match for a hybrid setup like "cloud = server, home = agent." You'll see this in the diagrams from chapter 4 onward.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. The control plane — three is the rule, but a two-node challenge
&lt;/h2&gt;

&lt;p&gt;Originally I ran personal services in the cloud with &lt;strong&gt;Docker Compose&lt;/strong&gt;. &lt;strong&gt;The small instance handled the DB&lt;/strong&gt; , and &lt;strong&gt;the large instance handled several microservices.&lt;/strong&gt; Moving these two to Kubernetes, my first worry was the control plane.&lt;/p&gt;

&lt;p&gt;For Kubernetes to be stable, &lt;strong&gt;control-plane HA&lt;/strong&gt; is the baseline. k3s's embedded etcd can't accept writes unless it keeps a majority (quorum), and the official HA guide recommends &lt;strong&gt;3 or more servers (an odd number).&lt;/strong&gt; With &lt;code&gt;n&lt;/code&gt; nodes the quorum is &lt;code&gt;(n/2)+1&lt;/code&gt;, and the node count minus the quorum is how many node failures you can tolerate.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;servers&lt;/th&gt;
&lt;th&gt;quorum&lt;/th&gt;
&lt;th&gt;failures tolerated&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjp6z85q97u2f2x04wjuh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjp6z85q97u2f2x04wjuh.png" alt="etcd quorum — 2 vs 3" width="799" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The rule is &lt;strong&gt;three.&lt;/strong&gt; But adding one more instance was tight on the wallet, so I changed the goal:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;_I know three is the right answer, but for now let me run &lt;strong&gt;two as stably as possible.&lt;/strong&gt; _&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In choosing two, I made two things clear.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;First, don't pile everything on one node.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I once put the control plane and services all on a single node and got badly burned. Lightsail is a &lt;strong&gt;burstable CPU&lt;/strong&gt; model: each plan has a per-vCPU baseline %, and when load stays above it for a while it &lt;strong&gt;spends the burst capacity&lt;/strong&gt; it had accrued, dropping to baseline once it hits 0. With the control plane (apiserver, etcd) on the same node, the moment the CPU dries up, cluster control itself stops — so I split the load across two nodes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;node&lt;/th&gt;
&lt;th&gt;plan&lt;/th&gt;
&lt;th&gt;vCPU&lt;/th&gt;
&lt;th&gt;baseline&lt;/th&gt;
&lt;th&gt;role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;server-A&lt;/td&gt;
&lt;td&gt;8GB ($44/mo)&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;cluster-init · control-plane+etcd+worker&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;server-B&lt;/td&gt;
&lt;td&gt;16GB ($84/mo)&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;40%&lt;/td&gt;
&lt;td&gt;join · control-plane+etcd+worker&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Checking usage at the time of writing, both are below baseline (the sustainable zone), accruing burst (&lt;code&gt;kubectl top nodes&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
cp-8gb-init 482m 24% 4565Mi 58%
cp-16gb-join 1153m 28% 10096Mi 65%

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F10esz88zeqtk5z7pub9c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F10esz88zeqtk5z7pub9c.png" alt="Lightsail burst CPU" width="799" height="373"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Second, admit that two is not HA, and take out insurance.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As the table shows, with two nodes, losing even one loses quorum and writes stop (pods already running keep going under kubelet, so it's "no changes" rather than "total outage"). I cover that risk with &lt;strong&gt;etcd automatic snapshots.&lt;/strong&gt; Since I gave no extra config, it runs with k3s defaults — &lt;code&gt;0 */12 * * *&lt;/code&gt; (twice a day), keep 5, stored at &lt;code&gt;/var/lib/rancher/k3s/server/db/snapshots&lt;/code&gt;. (&lt;a href="https://docs.k3s.io/cli/etcd-snapshot" rel="noopener noreferrer"&gt;etcd-snapshot&lt;/a&gt;) Since they only pile up locally, pushing them to NAS/object storage later is a task I've left for the backup installment.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Today's star — Tailscale
&lt;/h2&gt;

&lt;p&gt;The control plane is on Lightsail in Tokyo; the machine I'll use as a worker is the home iMac in Sapporo.&lt;/p&gt;

&lt;p&gt;These two &lt;strong&gt;don't share a private network.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The home machine sits behind a router on a private IP (192.168.x), so it can't be reached directly from outside, and opening ports to expose it would mean exposing cluster ports like kubelet (10250) and VXLAN (8472) to the internet — dangerous. For k3s to bind nodes into one cluster, everyone has to be able to call each other by &lt;strong&gt;one stable address&lt;/strong&gt; , and the current setup doesn't have that.&lt;/p&gt;

&lt;p&gt;So I went looking for a method among VPNs and meshes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Option&lt;/th&gt;
&lt;th&gt;Character&lt;/th&gt;
&lt;th&gt;For this situation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Direct port exposure + public IP&lt;/td&gt;
&lt;td&gt;Expose as-is without a VPN&lt;/td&gt;
&lt;td&gt;Effectively exposes kubelet/VXLAN to the internet → dangerous, dropped&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;raw WireGuard&lt;/td&gt;
&lt;td&gt;Fast kernel VPN, manual keys/peers&lt;/td&gt;
&lt;td&gt;Fast, but NAT traversal, key management, and access control are all manual&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenVPN&lt;/td&gt;
&lt;td&gt;Traditional hub-style VPN&lt;/td&gt;
&lt;td&gt;Hub-centric rather than mesh, heavy to set up&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZeroTier&lt;/td&gt;
&lt;td&gt;Managed mesh VPN&lt;/td&gt;
&lt;td&gt;A solid candidate, similar in flavor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tailscale&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;WireGuard + coordination (mesh)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Automatic NAT traversal, ACLs, MagicDNS, unattended keys, free for personal use ← &lt;strong&gt;chosen&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Headscale&lt;/td&gt;
&lt;td&gt;Self-hosted Tailscale control server&lt;/td&gt;
&lt;td&gt;More freedom but the burden of self-operation → consider later&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;After a lot of trial and deliberation that took plenty of time, in the end I chose Tailscale.&lt;/strong&gt; It's a WireGuard-based mesh VPN: install a daemon on each machine and log in, and it joins a private network (a &lt;strong&gt;tailnet&lt;/strong&gt; ) tied to your account, with each machine getting one address in the &lt;code&gt;100.x&lt;/code&gt; range. That address is reachable by the same value from anywhere — whether the machine is in Tokyo or behind a router in Sapporo — and Tailscale handles NAT traversal for you.&lt;/p&gt;

&lt;p&gt;It means you can lay down a "virtual LAN" that puts the cloud and home on one plane. (And up to 100 machines register for free.)&lt;/p&gt;

&lt;p&gt;When k3s registers a node, it stamps the address given via &lt;code&gt;--node-ip&lt;/code&gt; as that node's identity (InternalIP). So by setting this value to a Tailscale address from the start, a home node joining later lands on the same &lt;code&gt;100.x&lt;/code&gt; plane as-is. That's why I install &lt;strong&gt;Tailscale before k3s.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Tailscale: sign up · install · verify
&lt;/h2&gt;

&lt;p&gt;The order is sign up → install → verify.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;① Sign up.&lt;/strong&gt; Log in at &lt;a href="https://login.tailscale.com" rel="noopener noreferrer"&gt;login.tailscale.com&lt;/a&gt; with an &lt;strong&gt;SSO account&lt;/strong&gt; like Google, GitHub, or Microsoft, and a tailnet for that account is created automatically. There's no separate signup form; SSO is the signup.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.seon.world%2Fimages%2Fk3s-1%2Fk3s-1-tailscale-signin-wm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.seon.world%2Fimages%2Fk3s-1%2Fk3s-1-tailscale-signin-wm.png" alt="Tailscale sign-in screen" width="800" height="697"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;② (For servers) Prepare an auth key.&lt;/strong&gt; Cloud servers have no browser, so issue an &lt;strong&gt;auth key&lt;/strong&gt; (&lt;code&gt;tskey-…&lt;/code&gt;) in advance from the admin console under &lt;strong&gt;Settings → Keys.&lt;/strong&gt; You can skip this if you'll connect interactively.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.seon.world%2Fimages%2Fk3s-1%2Fk3s-1-tailscale-keys-wm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.seon.world%2Fimages%2Fk3s-1%2Fk3s-1-tailscale-keys-wm.png" alt="Tailscale admin console Keys" width="508" height="718"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;③ Install &amp;amp; connect.&lt;/strong&gt; On each of the two cloud nodes ( &lt;strong&gt;Amazon Linux 2023&lt;/strong&gt; ):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://tailscale.com/install.sh | sh
&lt;span class="nb"&gt;sudo &lt;/span&gt;tailscale up &lt;span class="c"&gt;# authenticate via the printed URL (headless: --authkey tskey-… )&lt;/span&gt;
tailscale ip &lt;span class="nt"&gt;-4&lt;/span&gt; &lt;span class="c"&gt;# this node's 100.x address — used directly as --node-ip in ch.6&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;④ Verify.&lt;/strong&gt; If both nodes appear in the admin console &lt;strong&gt;Machines&lt;/strong&gt; page (&lt;a href="https://login.tailscale.com/admin/machines" rel="noopener noreferrer"&gt;login.tailscale.com/admin/machines&lt;/a&gt;) with their &lt;code&gt;100.x&lt;/code&gt; address and hostname, it worked.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.seon.world%2Fimages%2Fk3s-1%2Fk3s-1-tailscale-machines-wm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.seon.world%2Fimages%2Fk3s-1%2Fk3s-1-tailscale-machines-wm.png" alt="Tailscale admin console Machines list" width="800" height="313"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can also check from the node:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;tailscale status &lt;span class="c"&gt;# list of machines in the tailnet + each one's 100.x&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this, the two cloud nodes see each other by &lt;code&gt;100.x&lt;/code&gt; in one tailnet. Now I bring up k3s with these addresses. (&lt;a href="https://tailscale.com/kb/1031/install-linux" rel="noopener noreferrer"&gt;Tailscale Linux install&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Installing k3s (with Tailscale addresses)
&lt;/h2&gt;

&lt;p&gt;Put the &lt;code&gt;100.x&lt;/code&gt; you got in chapter 5 straight into &lt;code&gt;--node-ip&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1rgfw8pnxvyige17i6nj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1rgfw8pnxvyige17i6nj.png" alt="Bootstrap &amp;amp; join flow" width="800" height="347"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;server-A (8GB)&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-sfL&lt;/span&gt; https://get.k3s.io | &lt;span class="nv"&gt;K3S_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;shared-secret&amp;gt; &lt;span class="nv"&gt;INSTALL_K3S_VERSION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;v1.34.3+k3s1 &lt;span class="se"&gt;\&lt;/span&gt;
  sh &lt;span class="nt"&gt;-s&lt;/span&gt; - server &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--cluster-init&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--node-ip&lt;/span&gt; 100.71.x.x &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--node-external-ip&lt;/span&gt; &amp;lt;publicA&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--advertise-address&lt;/span&gt; 100.71.x.x &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--flannel-backend&lt;/span&gt; vxlan

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--cluster-init&lt;/code&gt; — initializes embedded etcd as the first server. (&lt;a href="https://docs.k3s.io/cli/server" rel="noopener noreferrer"&gt;server flags&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--node-ip 100.71.x.x&lt;/code&gt; — advertises the Tailscale address received in ch.5 as the InternalIP.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--node-external-ip&lt;/code&gt; / &lt;code&gt;--advertise-address&lt;/code&gt; — public IP (for external exposure), apiserver advertise address (Tailscale).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--flannel-backend vxlan&lt;/code&gt; — CNI backend (the default, stated explicitly).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;K3S_TOKEN can be a value you set yourself, like choosing a password, or left blank for k3s to generate automatically. But since you need to know this value to join, save it separately or just pass the value at the path below.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;/var/lib/rancher/k3s/server/node-token&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;server-B (16GB) — joins as the second server.&lt;/strong&gt; This node, too, joins the tailnet first, then just connects with the same token:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-sfL&lt;/span&gt; https://get.k3s.io | &lt;span class="nv"&gt;K3S_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;secret&amp;gt; &lt;span class="nv"&gt;INSTALL_K3S_VERSION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;v1.34.3+k3s1 &lt;span class="se"&gt;\&lt;/span&gt;
  sh &lt;span class="nt"&gt;-s&lt;/span&gt; - server &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--server&lt;/span&gt; https://172.26.x.x:6443 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--node-ip&lt;/span&gt; 100.99.x.x

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--server https://172.26.x.x:6443&lt;/code&gt; = server-A's address (a private IP, since it's the same VPC).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--node-ip 100.99.x.x&lt;/code&gt; = this node's Tailscale address.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The two Lightsail boxes are in the &lt;strong&gt;same AWS VPC&lt;/strong&gt; , so joining itself used the private IP, but the InternalIP advertised to the cluster is Tailscale (&lt;code&gt;100.x&lt;/code&gt;) for both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Firewall&lt;/strong&gt; — open only the minimum externally. (&lt;a href="https://docs.k3s.io/installation/requirements" rel="noopener noreferrer"&gt;requirements&lt;/a&gt;)&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;port&lt;/th&gt;
&lt;th&gt;use&lt;/th&gt;
&lt;th&gt;exposure&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;80 / 443&lt;/td&gt;
&lt;td&gt;Traefik Ingress&lt;/td&gt;
&lt;td&gt;all&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;SSH&lt;/td&gt;
&lt;td&gt;my IP only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6443 / 2379-2380 / 8472 / 10250&lt;/td&gt;
&lt;td&gt;apiserver·etcd·flannel·kubelet&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;closed publicly&lt;/strong&gt; , private/Tailscale internal only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  7. Cluster setup — complete with two nodes
&lt;/h2&gt;

&lt;p&gt;Attaching the home iMac as an agent is covered in the next article.&lt;/p&gt;

&lt;p&gt;For now I've built the cluster with &lt;strong&gt;two Lightsail boxes, Tailscale applied.&lt;/strong&gt; Listing the nodes, you can confirm both are &lt;code&gt;Ready&lt;/code&gt; on the same version and runtime.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get nodes &lt;span class="nt"&gt;-o&lt;/span&gt; wide

NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
…3-146&lt;span class="o"&gt;(&lt;/span&gt;8GB&lt;span class="o"&gt;)&lt;/span&gt; Ready control-plane,etcd 139d v1.34.3+k3s1 100.71.x.x 52.x.x.x Amazon Linux 2023.7.20250512 6.1.134-…amzn2023.x86_64 containerd://2.1.5-k3s1
…2-70&lt;span class="o"&gt;(&lt;/span&gt;16GB&lt;span class="o"&gt;)&lt;/span&gt; Ready control-plane,etcd 139d v1.34.3+k3s1 100.99.x.x 3.x.x.x Amazon Linux 2023.9.20251105 6.1.156-…amzn2023.x86_64 containerd://2.1.5-k3s1

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check whether the two nodes are etcd voting members (look at Conditions in &lt;code&gt;kubectl describe node &amp;lt;name&amp;gt;&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Conditions:
  Type Status Reason Message
  ---- ------ ------ -------
  EtcdIsVoter True MemberNotLearner Node is a voting member of the etcd cluster
  MemoryPressure False KubeletHasSufficientMemory kubelet has sufficient memory available
  DiskPressure False KubeletHasNoDiskPressure kubelet has no disk pressure
  PIDPressure False KubeletHasSufficientPID kubelet has sufficient PID available
  Ready True KubeletReady kubelet is posting ready status

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check that the k3s default bundle came up too (&lt;code&gt;kubectl get pods -n kube-system&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# kubectl get pods -n kube-system → k3s default bundle only (excerpt)&lt;/span&gt;
coredns-7f496c8d7d-nx9jc 1/1 Running 139d &lt;span class="c"&gt;# DNS&lt;/span&gt;
local-path-provisioner-578895bd58-mgxpm 1/1 Running 139d &lt;span class="c"&gt;# local storage (default SC)&lt;/span&gt;
metrics-server-7b9c9c4b9c-76ldg 1/1 Running 139d &lt;span class="c"&gt;# metrics (kubectl top)&lt;/span&gt;
traefik-78df465dcc-66kn8 1/1 Running 9d &lt;span class="c"&gt;# Ingress (server-A)&lt;/span&gt;
traefik-78df465dcc-gs4q7 1/1 Running 8d &lt;span class="c"&gt;# Ingress (server-B) → one per node = 2 replicas&lt;/span&gt;
helm-install-traefik-crd-pmk4t 0/1 Completed 139d &lt;span class="c"&gt;# Helm Job that installed the bundle (completed)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That concludes setting up two cloud instances as a k3s cluster. It isn't just that I installed k3s — I also configured Tailscale so that, later, any machine can join as an agent regardless of where it is or what form it takes, as long as it's an environment where k3s can be configured.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Next
&lt;/h2&gt;

&lt;p&gt;The AWS Lightsail nodes are now formed into a cluster, and the groundwork for nodes to join is all set.&lt;/p&gt;

&lt;p&gt;In the end it came down to one command per node, but this stage took more time than I expected.&lt;/p&gt;

&lt;p&gt;To this two-node cluster, I'll now &lt;strong&gt;bring in the iMac resting at home, in earnest.&lt;/strong&gt; I'll install Lima VMs on the iMac, create an agent on each, join them to the same tailnet, and write up the problems I ran into after joining — solving them along the way.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;k3s — What is K3s / Architecture / Datastore: &lt;a href="https://docs.k3s.io/" rel="noopener noreferrer"&gt;https://docs.k3s.io/&lt;/a&gt; · /architecture · /datastore&lt;/li&gt;
&lt;li&gt;k3s — HA Embedded etcd / Server flags / etcd-snapshot / Requirements: &lt;a href="https://docs.k3s.io/datastore/ha-embedded" rel="noopener noreferrer"&gt;https://docs.k3s.io/datastore/ha-embedded&lt;/a&gt; · /cli/server · /cli/etcd-snapshot · /installation/requirements&lt;/li&gt;
&lt;li&gt;Lightweight distro comparison (k3s·k0s·MicroK8s): &lt;a href="https://palark.com/blog/small-local-kubernetes-comparison/" rel="noopener noreferrer"&gt;https://palark.com/blog/small-local-kubernetes-comparison/&lt;/a&gt; · &lt;a href="https://www.portainer.io/blog/k0s-vs-k3s" rel="noopener noreferrer"&gt;https://www.portainer.io/blog/k0s-vs-k3s&lt;/a&gt; · &lt;a href="https://www.nops.io/blog/k0s-vs-k3s-vs-k8s/" rel="noopener noreferrer"&gt;https://www.nops.io/blog/k0s-vs-k3s-vs-k8s/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Tailscale — Linux install: &lt;a href="https://tailscale.com/kb/1031/install-linux" rel="noopener noreferrer"&gt;https://tailscale.com/kb/1031/install-linux&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;AWS Lightsail — burst CPU / baseline: &lt;a href="https://docs.aws.amazon.com/lightsail/latest/userguide/baseline-cpu-performance.html" rel="noopener noreferrer"&gt;https://docs.aws.amazon.com/lightsail/latest/userguide/baseline-cpu-performance.html&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>k3s</category>
      <category>tailscale</category>
      <category>etcd</category>
    </item>
  </channel>
</rss>
