Building an on-prem Kubernetes cluster manager - Part 1: Why, and what it looks like

Ishara Ekanayaka — Sat, 02 May 2026 09:33:50 +0000

I started Kubesmith as a learning project: I wanted to understand the full stack under a managed Kubernetes offering by building one myself, on hardware I could touch. The scope grew into something that could plausibly run as an internal tool for a university department - self-service Kubernetes clusters for student projects and research groups, on the department's own Proxmox hosts, without going through IT for every request.

The problem, concretely
The requirements I designed for:

On-prem only. No cloud. A university department has hardware and a hypervisor; it doesn't have a GKE budget.
Multi-tenant. Several research groups or courses need their own clusters, isolated from each other.
Self-service. A student or researcher should be able to provision a cluster without a sysadmin in the loop.
RBAC. Not everyone who can see a cluster should be able to destroy it. A course instructor and a first-week student need different permissions.
Fully automated from zero. Template → VMs → Kubernetes → reachable kubeconfig, with no manual step in the middle.

"Self-service" is the word that pulls the whole design together. It's the reason this couldn't just be a folder of Terraform and an Ansible playbook. That works for me, sitting at a terminal. It doesn't work for a student who just wants a cluster for a distributed systems assignment and doesn't care about HCL.

The tool choices

Hypervisor: Proxmox VE. Bare-metal PCs as nodes doesn't scale - you're limited by physical boxes and you can't tear down and rebuild in seconds. A hypervisor gives you elasticity on the hardware you already own. Proxmox is open source, has a solid API, and runs on commodity hardware, which is the realistic picture of what a department would have.

VM provisioning: Terraform with the bpg/proxmox provider. The bpg provider is actively maintained and covers the Proxmox API surface I needed (cloud-init, clones, static IPs). Terraform's declarative model is a good fit for "I want N VMs with these specs."

Golden image: Packer. Terraform clones VMs from a template. Something has to build that template. Packer scripts an Ubuntu 22.04 autoinstall with cloud-init baked in, so Terraform can stamp out nodes configured at clone time. No GUI - these are server nodes.

Cluster configuration: Ansible. Once VMs exist, you still need to turn them into a Kubernetes cluster: containerd, kubeadm, CNI, join tokens. Ansible's strength is "take a fresh box and converge it to a state," which is exactly this problem. Three roles - common, control_plane, worker - map 1:1 to the conceptual stages.

API & UI: FastAPI + dashboard. This is the self-service layer. Without it, the project is a bundle of IaC scripts. With it, it's a product.

The architecture

web dashboard sends HTTP requests to a FastAPI REST API, which drives three tools - Terraform (VM lifecycle) calling the Proxmox VE API, Ansible (Kubernetes setup) reaching VMs over SSH, and Paramiko/SSH running kubectl on the control plane. The VMs host both the control plane and worker nodes of the Kubernetes cluster.

A detail worth pointing out: only Terraform talks to the Proxmox API. Ansible and Paramiko both talk directly to the VMs over SSH. That split is deliberate - Terraform owns infrastructure lifecycle, Ansible owns configuration inside a running machine, and the API server needs a way to run kubectl against a cluster without shipping kubeconfigs around. I'll dig into each of these in later posts.

Two more details worth calling out now, because they come back later:

Per-cluster Terraform workspaces. Each cluster gets its own workspaces/<cluster_id>/ directory with its own state file and tfvars. That's what makes it safe for multiple clusters to coexist - no shared state, no stepping on each other.
IP allocation from a pool. The API reserves a contiguous block of IPs from 10.40.19.201 onward per cluster, writes it into the generated tfvars, and Terraform injects it via cloud-init. This is the piece that makes "click a button" actually work without human IP planning.

What's coming

The rest of the series goes depth-first on each layer:

Part 2: Immutable infra - Packer and Terraform on Proxmox. Building the golden image, the bpg provider, static IPs via cloud-init.
Part 3: Ansible - turning VMs into a cluster. kubeadm init, Flannel CNI, the join-token dance between control plane and workers.
Part 4: The API layer. FastAPI, async job orchestration, per-cluster workspaces, the IP allocator.
Part 5: Multi-tenancy and RBAC. Sessions vs. API keys, the role hierarchy, resource-level permissions, and why kubectl-over-SSH was the right call for namespace operations.

DEV Community: Ishara Ekanayaka

Building an on-prem Kubernetes cluster manager - Part 1: Why, and what it looks like