<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ishara Ekanayaka</title>
    <description>The latest articles on DEV Community by Ishara Ekanayaka (@ishara_ekanayaka).</description>
    <link>https://dev.to/ishara_ekanayaka</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3908604%2F25c50d23-2349-471a-8b4c-1dfc9e9dbf4f.jpg</url>
      <title>DEV Community: Ishara Ekanayaka</title>
      <link>https://dev.to/ishara_ekanayaka</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ishara_ekanayaka"/>
    <language>en</language>
    <item>
      <title>Building an on-prem Kubernetes cluster manager - Part 1: Why, and what it looks like</title>
      <dc:creator>Ishara Ekanayaka</dc:creator>
      <pubDate>Sat, 02 May 2026 09:33:50 +0000</pubDate>
      <link>https://dev.to/ishara_ekanayaka/building-an-on-prem-kubernetes-cluster-manager-part-1-why-and-what-it-looks-like-39k</link>
      <guid>https://dev.to/ishara_ekanayaka/building-an-on-prem-kubernetes-cluster-manager-part-1-why-and-what-it-looks-like-39k</guid>
      <description>&lt;p&gt;I started &lt;a href="https://github.com/IsharaEkanayaka/kubesmith" rel="noopener noreferrer"&gt;Kubesmith&lt;/a&gt; as a learning project: I wanted to understand the full stack under a managed Kubernetes offering by building one myself, on hardware I could touch. The scope grew into something that could plausibly run as an internal tool for a university department - self-service Kubernetes clusters for student projects and research groups, on the department's own Proxmox hosts, without going through IT for every request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem, concretely&lt;/strong&gt;&lt;br&gt;
The requirements I designed for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On-prem only. No cloud. A university department has hardware and a hypervisor; it doesn't have a GKE budget.&lt;/li&gt;
&lt;li&gt;Multi-tenant. Several research groups or courses need their own clusters, isolated from each other.&lt;/li&gt;
&lt;li&gt;Self-service. A student or researcher should be able to provision a cluster without a sysadmin in the loop.&lt;/li&gt;
&lt;li&gt;RBAC. Not everyone who can see a cluster should be able to destroy it. A course instructor and a first-week student need different permissions.&lt;/li&gt;
&lt;li&gt;Fully automated from zero. Template → VMs → Kubernetes → reachable kubeconfig, with no manual step in the middle.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;"Self-service" is the word that pulls the whole design together. It's the reason this couldn't just be a folder of Terraform and an Ansible playbook. That works for me, sitting at a terminal. It doesn't work for a student who just wants a cluster for a distributed systems assignment and doesn't care about HCL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The tool choices&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypervisor&lt;/strong&gt;: Proxmox VE. Bare-metal PCs as nodes doesn't scale - you're limited by physical boxes and you can't tear down and rebuild in seconds. A hypervisor gives you elasticity on the hardware you already own. Proxmox is open source, has a solid API, and runs on commodity hardware, which is the realistic picture of what a department would have.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VM provisioning&lt;/strong&gt;: Terraform with the bpg/proxmox provider. The bpg provider is actively maintained and covers the Proxmox API surface I needed (cloud-init, clones, static IPs). Terraform's declarative model is a good fit for "I want N VMs with these specs."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Golden image&lt;/strong&gt;: Packer. Terraform clones VMs from a template. Something has to build that template. Packer scripts an Ubuntu 22.04 autoinstall with cloud-init baked in, so Terraform can stamp out nodes configured at clone time. No GUI - these are server nodes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster configuration&lt;/strong&gt;: Ansible. Once VMs exist, you still need to turn them into a Kubernetes cluster: containerd, kubeadm, CNI, join tokens. Ansible's strength is "take a fresh box and converge it to a state," which is exactly this problem. Three roles - common, control_plane, worker - map 1:1 to the conceptual stages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API &amp;amp; UI&lt;/strong&gt;: FastAPI + dashboard. This is the self-service layer. Without it, the project is a bundle of IaC scripts. With it, it's a product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdsmu9db55glkoq6o5haf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdsmu9db55glkoq6o5haf.png" alt=" " width="611" height="593"&gt;&lt;/a&gt;&lt;br&gt;
web dashboard sends HTTP requests to a FastAPI REST API, which drives three tools - Terraform (VM lifecycle) calling the Proxmox VE API, Ansible (Kubernetes setup) reaching VMs over SSH, and Paramiko/SSH running kubectl on the control plane. The VMs host both the control plane and worker nodes of the Kubernetes cluster.&lt;/p&gt;

&lt;p&gt;A detail worth pointing out: only Terraform talks to the Proxmox API. Ansible and Paramiko both talk directly to the VMs over SSH. That split is deliberate - Terraform owns infrastructure lifecycle, Ansible owns configuration inside a running machine, and the API server needs a way to run kubectl against a cluster without shipping kubeconfigs around. I'll dig into each of these in later posts.&lt;/p&gt;

&lt;p&gt;Two more details worth calling out now, because they come back later:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Per-cluster Terraform workspaces. Each cluster gets its own &lt;code&gt;workspaces/&amp;lt;cluster_id&amp;gt;/&lt;/code&gt; directory with its own state file and tfvars. That's what makes it safe for multiple clusters to coexist - no shared state, no stepping on each other.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;IP allocation from a pool. The API reserves a contiguous block of IPs from 10.40.19.201 onward per cluster, writes it into the generated tfvars, and Terraform injects it via cloud-init. This is the piece that makes "click a button" actually work without human IP planning.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What's coming&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The rest of the series goes depth-first on each layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Part 2: Immutable infra - Packer and Terraform on Proxmox. Building the golden image, the bpg provider, static IPs via cloud-init.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Part 3: Ansible - turning VMs into a cluster. kubeadm init, Flannel CNI, the join-token dance between control plane and workers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Part 4: The API layer. FastAPI, async job orchestration, per-cluster workspaces, the IP allocator.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Part 5: Multi-tenancy and RBAC. Sessions vs. API keys, the role hierarchy, resource-level permissions, and why kubectl-over-SSH was the right call for namespace operations.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>devops</category>
      <category>kubernetes</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
