<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ishara Ekanayaka</title>
    <description>The latest articles on DEV Community by Ishara Ekanayaka (@ishara_ekanayaka).</description>
    <link>https://dev.to/ishara_ekanayaka</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3908604%2F25c50d23-2349-471a-8b4c-1dfc9e9dbf4f.jpg</url>
      <title>DEV Community: Ishara Ekanayaka</title>
      <link>https://dev.to/ishara_ekanayaka</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ishara_ekanayaka"/>
    <language>en</language>
    <item>
      <title>Design and Implementation of a Slurm-Based HPC Cluster</title>
      <dc:creator>Ishara Ekanayaka</dc:creator>
      <pubDate>Mon, 01 Jun 2026 16:34:09 +0000</pubDate>
      <link>https://dev.to/ishara_ekanayaka/design-and-implementation-of-a-slurm-based-hpc-cluster-4e2g</link>
      <guid>https://dev.to/ishara_ekanayaka/design-and-implementation-of-a-slurm-based-hpc-cluster-4e2g</guid>
      <description>&lt;p&gt;&lt;strong&gt;The Problem&lt;/strong&gt;&lt;br&gt;
Managing a growing fleet of GPU and HPC servers one-by-one doesn't scale - here's how we fixed it.&lt;/p&gt;

&lt;p&gt;Our department has several computing resources, including GPU servers and HPC servers. Previously, these machines were managed and accessed individually. As the number of servers and users grew, so did the inefficiencies.&lt;/p&gt;

&lt;p&gt;From an administrator's perspective, it was difficult to monitor resource usage and ensure fair utilization across machines. Some servers would become heavily loaded while others sat idle, with no automatic balancing in place.&lt;/p&gt;

&lt;p&gt;From a user's perspective, researchers and students had to manually decide which server to use - without any visibility into current availability or load.&lt;/p&gt;

&lt;p&gt;To address these challenges, we built a Slurm-based HPC cluster. Users now submit jobs with their resource requirements, and the scheduler automatically selects an appropriate compute node. This simplifies resource management and allows the department's computing infrastructure to be utilized far more efficiently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Is Slurm?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager widely used in HPC environments. It handles job queuing, scheduling, and resource allocation across a cluster of machines — letting users focus on their work rather than infrastructure logistics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvwgjdef5ewkerqs99g0d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvwgjdef5ewkerqs99g0d.png" alt="SLURM Architecture" width="800" height="345"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The cluster consists of three main components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Login node&lt;/strong&gt; — the entry point where users connect and submit jobs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Controller node (CERF)&lt;/strong&gt; — runs &lt;code&gt;slurmctld&lt;/code&gt;, the central Slurm daemon responsible for scheduling decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute nodes&lt;/strong&gt; — &lt;code&gt;kepler&lt;/code&gt; (GPU node) and &lt;code&gt;aiken&lt;/code&gt; (HPC node), each running &lt;code&gt;slurmd&lt;/code&gt; to receive and execute assigned jobs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Users submit jobs to the controller, which schedules and dispatches them to the appropriate compute node based on resource availability and job requirements.&lt;/p&gt;

&lt;p&gt;Communication between Slurm components is secured using &lt;strong&gt;Munge authentication&lt;/strong&gt;, which establishes trust between cluster nodes and ensures that scheduling operations are performed securely. User authentication is handled separately through the standard Linux user management system.&lt;/p&gt;

&lt;p&gt;A Slurm partition groups the available compute resources, and the controller maintains real-time information about each node's state — allowing it to make informed scheduling decisions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Submitting Jobs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once the cluster was operational, users could submit workloads through standard Slurm commands. Instead of SSH-ing directly into a compute node, users interact with the cluster through the login node using commands like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;srun&lt;/code&gt; — run a command interactively on an allocated node&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sbatch&lt;/code&gt; — submit a batch script to be executed asynchronously&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;squeue&lt;/code&gt; — view the current job queue and job statuses&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;scancel&lt;/code&gt; — cancel a running or pending job&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The request is sent to the Slurm controller, which identifies a suitable compute node and dispatches the job for execution.&lt;/p&gt;

&lt;p&gt;The figure below shows a simple &lt;code&gt;srun&lt;/code&gt; command being submitted from the login node, scheduled by the controller, and executed on the &lt;code&gt;kepler&lt;/code&gt; GPU node — with the output returned directly to the user's terminal.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2lix0z4vkr7cmbj96ld5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2lix0z4vkr7cmbj96ld5.png" alt="Example" width="792" height="116"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>automation</category>
      <category>distributedsystems</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Building an on-prem Kubernetes cluster manager - Part 1: Why, and what it looks like</title>
      <dc:creator>Ishara Ekanayaka</dc:creator>
      <pubDate>Sat, 02 May 2026 09:33:50 +0000</pubDate>
      <link>https://dev.to/ishara_ekanayaka/building-an-on-prem-kubernetes-cluster-manager-part-1-why-and-what-it-looks-like-39k</link>
      <guid>https://dev.to/ishara_ekanayaka/building-an-on-prem-kubernetes-cluster-manager-part-1-why-and-what-it-looks-like-39k</guid>
      <description>&lt;p&gt;I started &lt;a href="https://github.com/IsharaEkanayaka/kubesmith" rel="noopener noreferrer"&gt;Kubesmith&lt;/a&gt; as a learning project: I wanted to understand the full stack under a managed Kubernetes offering by building one myself, on hardware I could touch. The scope grew into something that could plausibly run as an internal tool for a university department - self-service Kubernetes clusters for student projects and research groups, on the department's own Proxmox hosts, without going through IT for every request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem, concretely&lt;/strong&gt;&lt;br&gt;
The requirements I designed for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On-prem only. No cloud. A university department has hardware and a hypervisor; it doesn't have a GKE budget.&lt;/li&gt;
&lt;li&gt;Multi-tenant. Several research groups or courses need their own clusters, isolated from each other.&lt;/li&gt;
&lt;li&gt;Self-service. A student or researcher should be able to provision a cluster without a sysadmin in the loop.&lt;/li&gt;
&lt;li&gt;RBAC. Not everyone who can see a cluster should be able to destroy it. A course instructor and a first-week student need different permissions.&lt;/li&gt;
&lt;li&gt;Fully automated from zero. Template → VMs → Kubernetes → reachable kubeconfig, with no manual step in the middle.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;"Self-service" is the word that pulls the whole design together. It's the reason this couldn't just be a folder of Terraform and an Ansible playbook. That works for me, sitting at a terminal. It doesn't work for a student who just wants a cluster for a distributed systems assignment and doesn't care about HCL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The tool choices&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypervisor&lt;/strong&gt;: Proxmox VE. Bare-metal PCs as nodes doesn't scale - you're limited by physical boxes and you can't tear down and rebuild in seconds. A hypervisor gives you elasticity on the hardware you already own. Proxmox is open source, has a solid API, and runs on commodity hardware, which is the realistic picture of what a department would have.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VM provisioning&lt;/strong&gt;: Terraform with the bpg/proxmox provider. The bpg provider is actively maintained and covers the Proxmox API surface I needed (cloud-init, clones, static IPs). Terraform's declarative model is a good fit for "I want N VMs with these specs."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Golden image&lt;/strong&gt;: Packer. Terraform clones VMs from a template. Something has to build that template. Packer scripts an Ubuntu 22.04 autoinstall with cloud-init baked in, so Terraform can stamp out nodes configured at clone time. No GUI - these are server nodes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster configuration&lt;/strong&gt;: Ansible. Once VMs exist, you still need to turn them into a Kubernetes cluster: containerd, kubeadm, CNI, join tokens. Ansible's strength is "take a fresh box and converge it to a state," which is exactly this problem. Three roles - common, control_plane, worker - map 1:1 to the conceptual stages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API &amp;amp; UI&lt;/strong&gt;: FastAPI + dashboard. This is the self-service layer. Without it, the project is a bundle of IaC scripts. With it, it's a product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdsmu9db55glkoq6o5haf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdsmu9db55glkoq6o5haf.png" alt=" " width="611" height="593"&gt;&lt;/a&gt;&lt;br&gt;
web dashboard sends HTTP requests to a FastAPI REST API, which drives three tools - Terraform (VM lifecycle) calling the Proxmox VE API, Ansible (Kubernetes setup) reaching VMs over SSH, and Paramiko/SSH running kubectl on the control plane. The VMs host both the control plane and worker nodes of the Kubernetes cluster.&lt;/p&gt;

&lt;p&gt;A detail worth pointing out: only Terraform talks to the Proxmox API. Ansible and Paramiko both talk directly to the VMs over SSH. That split is deliberate - Terraform owns infrastructure lifecycle, Ansible owns configuration inside a running machine, and the API server needs a way to run kubectl against a cluster without shipping kubeconfigs around. I'll dig into each of these in later posts.&lt;/p&gt;

&lt;p&gt;Two more details worth calling out now, because they come back later:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Per-cluster Terraform workspaces. Each cluster gets its own &lt;code&gt;workspaces/&amp;lt;cluster_id&amp;gt;/&lt;/code&gt; directory with its own state file and tfvars. That's what makes it safe for multiple clusters to coexist - no shared state, no stepping on each other.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;IP allocation from a pool. The API reserves a contiguous block of IPs from 10.40.19.201 onward per cluster, writes it into the generated tfvars, and Terraform injects it via cloud-init. This is the piece that makes "click a button" actually work without human IP planning.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What's coming&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The rest of the series goes depth-first on each layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Part 2: Immutable infra - Packer and Terraform on Proxmox. Building the golden image, the bpg provider, static IPs via cloud-init.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Part 3: Ansible - turning VMs into a cluster. kubeadm init, Flannel CNI, the join-token dance between control plane and workers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Part 4: The API layer. FastAPI, async job orchestration, per-cluster workspaces, the IP allocator.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Part 5: Multi-tenancy and RBAC. Sessions vs. API keys, the role hierarchy, resource-level permissions, and why kubectl-over-SSH was the right call for namespace operations.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>devops</category>
      <category>kubernetes</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
