<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Muhammad Zubair Bin Akbar</title>
    <description>The latest articles on DEV Community by Muhammad Zubair Bin Akbar (@zubairakbar).</description>
    <link>https://dev.to/zubairakbar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3874077%2F58ef1a6a-88ea-4af9-925d-f9a18ea98939.jpeg</url>
      <title>DEV Community: Muhammad Zubair Bin Akbar</title>
      <link>https://dev.to/zubairakbar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/zubairakbar"/>
    <language>en</language>
    <item>
      <title>CPU Pinning and Affinity in HPC: Why Performance Changes Drastically</title>
      <dc:creator>Muhammad Zubair Bin Akbar</dc:creator>
      <pubDate>Sat, 16 May 2026 18:58:33 +0000</pubDate>
      <link>https://dev.to/zubairakbar/cpu-pinning-and-affinity-in-hpc-why-performance-changes-drastically-5f5j</link>
      <guid>https://dev.to/zubairakbar/cpu-pinning-and-affinity-in-hpc-why-performance-changes-drastically-5f5j</guid>
      <description>&lt;p&gt;In HPC environments, users often notice something confusing:&lt;/p&gt;

&lt;p&gt;The same application, same input, and same number of CPUs can produce very different performance results across runs.&lt;/p&gt;

&lt;p&gt;One of the biggest reasons behind this is CPU pinning and CPU affinity.&lt;/p&gt;

&lt;p&gt;Without proper CPU placement, processes can bounce between cores, compete for cache, and suffer from NUMA penalties. In large parallel workloads, this can drastically reduce performance.&lt;/p&gt;

&lt;p&gt;This blog explains what CPU pinning and affinity are, why they matter in HPC, and how they impact real workloads.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is CPU Affinity?
&lt;/h2&gt;

&lt;p&gt;CPU affinity controls which CPU cores a process or thread is allowed to run on.&lt;/p&gt;

&lt;p&gt;The operating system scheduler can still move the process between the allowed cores, but only within that defined CPU set.&lt;/p&gt;

&lt;h3&gt;
  
  
  For example:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;A process may be allowed to run only on cores 0 to 7&lt;/li&gt;
&lt;li&gt;The scheduler can move it between those cores if needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Affinity helps improve cache locality and reduces unnecessary movement across the entire system.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is CPU Pinning?
&lt;/h2&gt;

&lt;p&gt;CPU pinning is the actual act of locking a process or thread to a CPU core.&lt;/p&gt;

&lt;p&gt;In HPC clusters, schedulers like Slurm often handle this automatically through CPU binding options.&lt;/p&gt;

&lt;h3&gt;
  
  
  For example:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;MPI rank 0 stays on core 0&lt;/li&gt;
&lt;li&gt;MPI rank 1 stays on core 1&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This minimizes CPU migrations and provides more predictable performance for HPC workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pinning ensures:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Better cache locality&lt;/li&gt;
&lt;li&gt;Reduced scheduler overhead&lt;/li&gt;
&lt;li&gt;Predictable performance&lt;/li&gt;
&lt;li&gt;Lower NUMA latency&lt;/li&gt;
&lt;li&gt;Reduced context switching&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without pinning, Linux may move tasks between cores frequently depending on system activity.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Performance Changes So Much
&lt;/h2&gt;

&lt;p&gt;Modern HPC nodes are complex.&lt;/p&gt;

&lt;h3&gt;
  
  
  A single node may contain:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Multiple CPU sockets&lt;/li&gt;
&lt;li&gt;NUMA regions&lt;/li&gt;
&lt;li&gt;Shared and private caches&lt;/li&gt;
&lt;li&gt;Hyperthreading&lt;/li&gt;
&lt;li&gt;Hundreds of logical CPUs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When processes move randomly between CPUs, several problems appear.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cache Locality Problems
&lt;/h2&gt;

&lt;p&gt;CPUs rely heavily on cache memory.&lt;/p&gt;

&lt;p&gt;If a thread keeps running on the same core, cached data remains available and execution becomes faster.&lt;/p&gt;

&lt;p&gt;When the thread migrates to another core:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache must be rebuilt&lt;/li&gt;
&lt;li&gt;Memory access latency increases&lt;/li&gt;
&lt;li&gt;CPU cycles are wasted&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This becomes extremely expensive for tightly coupled MPI applications.&lt;/p&gt;




&lt;h2&gt;
  
  
  NUMA Effects
&lt;/h2&gt;

&lt;p&gt;NUMA stands for Non Uniform Memory Access.&lt;/p&gt;

&lt;p&gt;In multi socket systems, memory attached to the local CPU socket is faster than memory attached to another socket.&lt;/p&gt;

&lt;p&gt;If a process runs on Socket 0 but accesses memory allocated on Socket 1:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Memory latency increases&lt;/li&gt;
&lt;li&gt;Bandwidth decreases&lt;/li&gt;
&lt;li&gt;Application performance drops&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is one of the most common reasons HPC jobs scale poorly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Example of Bad CPU Placement
&lt;/h2&gt;

&lt;p&gt;Consider a dual socket server:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Socket 0 → cores 0 to 31&lt;/li&gt;
&lt;li&gt;Socket 1 → cores 32 to 63&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If an MPI application launches ranks without proper affinity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rank 0 may start on core 2&lt;/li&gt;
&lt;li&gt;Later move to core 40&lt;/li&gt;
&lt;li&gt;Then back to core 10&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now the application suffers from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Remote memory access&lt;/li&gt;
&lt;li&gt;Cache misses&lt;/li&gt;
&lt;li&gt;CPU migration overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result can be a major slowdown even though CPU usage appears high.&lt;/p&gt;




&lt;h2&gt;
  
  
  MPI and CPU Binding
&lt;/h2&gt;

&lt;p&gt;MPI applications are very sensitive to where processes run on the CPU.&lt;/p&gt;

&lt;p&gt;If MPI ranks keep moving between cores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache data gets lost&lt;/li&gt;
&lt;li&gt;Memory access becomes slower&lt;/li&gt;
&lt;li&gt;Communication latency increases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To avoid this, MPI runtimes and schedulers use CPU binding or pinning.&lt;/p&gt;

&lt;h3&gt;
  
  
  For example with Open MPI:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mpirun &lt;span class="nt"&gt;--bind-to&lt;/span&gt; core &lt;span class="nt"&gt;--map-by&lt;/span&gt; socket ./app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  With Slurm:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;srun &lt;span class="nt"&gt;--cpu-bind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;cores ./app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These settings keep MPI processes fixed to specific CPU cores, which usually provides more stable and faster performance in HPC workloads.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hyperthreading Can Also Matter
&lt;/h2&gt;

&lt;p&gt;Some workloads perform poorly when pinned to logical CPUs instead of physical cores.&lt;/p&gt;

&lt;p&gt;For compute intensive applications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Two threads sharing one physical core may compete for resources&lt;/li&gt;
&lt;li&gt;Floating point performance may decrease&lt;/li&gt;
&lt;li&gt;Memory bandwidth may become limited&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why many HPC sites disable hyperthreading for production workloads.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real World Performance Difference
&lt;/h2&gt;

&lt;p&gt;In many HPC benchmarks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Proper CPU affinity can improve performance by 10% to 40%&lt;/li&gt;
&lt;li&gt;NUMA aware placement can reduce latency significantly&lt;/li&gt;
&lt;li&gt;Communication heavy MPI jobs benefit the most&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Applications such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CFD solvers&lt;/li&gt;
&lt;li&gt;Molecular dynamics&lt;/li&gt;
&lt;li&gt;Finite element simulations&lt;/li&gt;
&lt;li&gt;AI training workloads&lt;/li&gt;
&lt;li&gt;Weather modeling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;are highly sensitive to CPU placement.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Check CPU Affinity
&lt;/h2&gt;

&lt;p&gt;Useful Linux tools include:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;taskset &lt;span class="nt"&gt;-p&lt;/span&gt; &amp;lt;pid&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;numactl &lt;span class="nt"&gt;--show&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;lscpu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hwloc-ls
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;hwloc&lt;/code&gt; package is especially useful for visualizing CPU topology and NUMA layout.&lt;/p&gt;




&lt;h2&gt;
  
  
  Best Practices in HPC
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Use Scheduler Managed Affinity
&lt;/h3&gt;

&lt;p&gt;Let the cluster scheduler manage CPU placement whenever possible.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#SBATCH --cpus-per-task=8&lt;/span&gt;
&lt;span class="c"&gt;#SBATCH --cpu-bind=cores&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  2. Keep MPI Ranks NUMA Aware
&lt;/h3&gt;

&lt;p&gt;Try to keep MPI ranks and memory allocations within the same NUMA domain.&lt;/p&gt;

&lt;p&gt;Tools like &lt;code&gt;numactl&lt;/code&gt; can help.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Benchmark Different Configurations
&lt;/h3&gt;

&lt;p&gt;Different applications behave differently.&lt;/p&gt;

&lt;p&gt;Always test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Core binding&lt;/li&gt;
&lt;li&gt;Socket binding&lt;/li&gt;
&lt;li&gt;NUMA placement&lt;/li&gt;
&lt;li&gt;Hyperthreading enabled vs disabled&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  4. Monitor CPU Migrations
&lt;/h3&gt;

&lt;p&gt;High CPU migrations can indicate poor affinity configuration.&lt;/p&gt;

&lt;p&gt;Useful commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pidstat &lt;span class="nt"&gt;-w&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;perf &lt;span class="nb"&gt;stat&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;CPU pinning and affinity are often overlooked in HPC environments, but they directly affect application scalability and runtime consistency.&lt;/p&gt;

&lt;p&gt;Two jobs using the same resources can perform very differently simply because of process placement. Understanding CPU topology, NUMA behavior, and scheduler affinity policies is essential for getting the best performance from modern HPC clusters.&lt;/p&gt;

&lt;p&gt;In many cases, properly placing and pinning processes to CPU cores can improve performance without upgrading the hardware.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>hpc</category>
      <category>slurm</category>
    </item>
    <item>
      <title>How Slurm Handles Resource Allocation Internally</title>
      <dc:creator>Muhammad Zubair Bin Akbar</dc:creator>
      <pubDate>Wed, 13 May 2026 19:05:33 +0000</pubDate>
      <link>https://dev.to/zubairakbar/how-slurm-handles-resource-allocation-internally-3fj5</link>
      <guid>https://dev.to/zubairakbar/how-slurm-handles-resource-allocation-internally-3fj5</guid>
      <description>&lt;p&gt;If you work with HPC clusters, chances are you use slurm every day to submit jobs, monitor queues, and manage compute resources.&lt;/p&gt;

&lt;p&gt;Most users know commands like sbatch, squeue, and sinfo, but fewer understand what actually happens internally when a job is submitted.&lt;/p&gt;

&lt;p&gt;This article explains how Slurm handles resource allocation behind the scenes, from job submission to execution on compute nodes.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  What Happens When You Submit a Job?
&lt;/h2&gt;

&lt;p&gt;When a user runs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sbatch job.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Slurm begins a multi step workflow internally.&lt;/p&gt;

&lt;h3&gt;
  
  
  The main components involved are:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;slurmctld → Central controller daemon&lt;/li&gt;
&lt;li&gt;slurmd → Compute node daemon&lt;/li&gt;
&lt;li&gt;slurmdbd → Accounting database daemon (optional but common)&lt;/li&gt;
&lt;li&gt;Scheduler plugin&lt;/li&gt;
&lt;li&gt;Select plugin&lt;/li&gt;
&lt;li&gt;Cgroups/task plugins&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each component has a specific role in resource allocation.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Job Submission
&lt;/h2&gt;

&lt;p&gt;The sbatch command sends the job request to slurmctld.&lt;/p&gt;

&lt;h3&gt;
  
  
  The request includes:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Number of nodes&lt;/li&gt;
&lt;li&gt;CPUs&lt;/li&gt;
&lt;li&gt;Memory&lt;/li&gt;
&lt;li&gt;GPUs&lt;/li&gt;
&lt;li&gt;Time limit&lt;/li&gt;
&lt;li&gt;Partition&lt;/li&gt;
&lt;li&gt;Constraints&lt;/li&gt;
&lt;li&gt;QoS&lt;/li&gt;
&lt;li&gt;Account information&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#SBATCH --nodes=2&lt;/span&gt;
&lt;span class="c"&gt;#SBATCH --ntasks-per-node=32&lt;/span&gt;
&lt;span class="c"&gt;#SBATCH --mem=128G&lt;/span&gt;
&lt;span class="c"&gt;#SBATCH --time=04:00:00&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this stage, Slurm creates a job record and places it into the pending queue.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Job Validation
&lt;/h2&gt;

&lt;p&gt;Before scheduling the job, Slurm validates several things internally.&lt;/p&gt;

&lt;h3&gt;
  
  
  User &amp;amp; Account Checks
&lt;/h3&gt;

&lt;p&gt;Slurm verifies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User permissions&lt;/li&gt;
&lt;li&gt;Account associations&lt;/li&gt;
&lt;li&gt;QoS limits&lt;/li&gt;
&lt;li&gt;Fairshare policies&lt;/li&gt;
&lt;li&gt;Partition access&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If accounting is enabled, slurmdbd provides usage statistics and limits.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Scheduler Evaluation
&lt;/h2&gt;

&lt;p&gt;Now the scheduler starts evaluating the job.&lt;/p&gt;

&lt;p&gt;The default scheduler in Slurm is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sched/backfill
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This scheduler performs two important tasks:&lt;/p&gt;

&lt;p&gt;Main Scheduling Pass&lt;/p&gt;

&lt;h3&gt;
  
  
  It checks:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Available resources&lt;/li&gt;
&lt;li&gt;Job priority&lt;/li&gt;
&lt;li&gt;Node states&lt;/li&gt;
&lt;li&gt;Reservations&lt;/li&gt;
&lt;li&gt;Limits&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Backfill Scheduling
&lt;/h3&gt;

&lt;p&gt;Backfill allows smaller jobs to run without delaying higher priority jobs.&lt;/p&gt;

&lt;p&gt;This improves overall cluster utilization.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  How Job Priority Is Calculated
&lt;/h2&gt;

&lt;p&gt;Slurm calculates a dynamic priority score.&lt;/p&gt;

&lt;h3&gt;
  
  
  Factors include:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Fairshare usage&lt;/li&gt;
&lt;li&gt;Job age&lt;/li&gt;
&lt;li&gt;Job size&lt;/li&gt;
&lt;li&gt;Partition priority&lt;/li&gt;
&lt;li&gt;QoS priority&lt;/li&gt;
&lt;li&gt;Association priority&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Internally, the priority plugin combines these values into a single score.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example:
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;Priority = Age + Fairshare + JobSize + Partition + QoS&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Higher score means earlier scheduling.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Resource Selection
&lt;/h2&gt;

&lt;p&gt;Once the scheduler decides to run the job, Slurm uses the select plugin.&lt;/p&gt;

&lt;h3&gt;
  
  
  Most clusters use:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select/cons_tres
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This plugin handles consumable resources using TRES.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  What Are TRES?
&lt;/h2&gt;

&lt;p&gt;TRES stands for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Trackable RESources
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Examples:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;CPU&lt;/li&gt;
&lt;li&gt;Memory&lt;/li&gt;
&lt;li&gt;GPU&lt;/li&gt;
&lt;li&gt;Node&lt;/li&gt;
&lt;li&gt;License&lt;/li&gt;
&lt;li&gt;Burst buffer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This model allows Slurm to track resources very precisely.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Internal Node Selection
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The select plugin now determines:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Which nodes are eligible&lt;/li&gt;
&lt;li&gt;How CPUs are distributed&lt;/li&gt;
&lt;li&gt;Memory allocation&lt;/li&gt;
&lt;li&gt;GPU placement&lt;/li&gt;
&lt;li&gt;Socket/core binding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Slurm checks node topology information stored in memory by slurmctld.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example:
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;NodeA&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="s"&gt;64 CPUs&lt;/span&gt;
  &lt;span class="s"&gt;512 GB RAM&lt;/span&gt;
  &lt;span class="s"&gt;4 GPUs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  If the job requests:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;32 CPUs + 2 GPUs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Slurm reserves exactly those resources internally.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Resource Reservation
&lt;/h2&gt;

&lt;p&gt;After node selection, Slurm marks resources as allocated.&lt;/p&gt;

&lt;h3&gt;
  
  
  Internally:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;CPUs become unavailable to other jobs&lt;/li&gt;
&lt;li&gt;Memory counters are reduced&lt;/li&gt;
&lt;li&gt;GPUs are reserved&lt;/li&gt;
&lt;li&gt;Node state changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can observe this using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scontrol show node
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;squeue
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6: Launching the Job
&lt;/h2&gt;

&lt;p&gt;Now slurmctld contacts the slurmd daemon on allocated nodes.&lt;/p&gt;

&lt;h3&gt;
  
  
  The compute node daemon performs:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Environment setup&lt;/li&gt;
&lt;li&gt;UID/GID validation&lt;/li&gt;
&lt;li&gt;Cgroup creation&lt;/li&gt;
&lt;li&gt;CPU binding&lt;/li&gt;
&lt;li&gt;Memory enforcement&lt;/li&gt;
&lt;li&gt;Task launching&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  How Cgroups Enforce Limits
&lt;/h2&gt;

&lt;p&gt;Modern Slurm clusters heavily rely on Linux cgroups.&lt;/p&gt;

&lt;p&gt;Cgroups ensure a job cannot exceed allocated resources.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;h3&gt;
  
  
  CPU Enforcement
&lt;/h3&gt;

&lt;p&gt;Only allocated CPU cores are accessible&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory Enforcement
&lt;/h3&gt;

&lt;p&gt;Memory usage beyond limit triggers OOM kill&lt;/p&gt;

&lt;h3&gt;
  
  
  GPU Isolation
&lt;/h3&gt;

&lt;p&gt;Only assigned GPUs are visible&lt;/p&gt;

&lt;p&gt;This is why users see:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;CUDA_VISIBLE_DEVICES=0&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;automatically set inside jobs.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  CPU Binding and Affinity
&lt;/h2&gt;

&lt;p&gt;Slurm also handles CPU affinity internally.&lt;/p&gt;

&lt;h3&gt;
  
  
  This improves:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;NUMA locality&lt;/li&gt;
&lt;li&gt;Cache efficiency&lt;/li&gt;
&lt;li&gt;MPI performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;srun &lt;span class="nt"&gt;--cpu-bind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;cores
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Internally, Slurm maps tasks to specific CPU cores using topology-aware scheduling.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 7: Job Execution
&lt;/h2&gt;

&lt;p&gt;Once everything is configured:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Processes start&lt;/li&gt;
&lt;li&gt;Accounting begins&lt;/li&gt;
&lt;li&gt;Usage metrics are collected&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Slurm tracks:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;CPU time&lt;/li&gt;
&lt;li&gt;Memory usage&lt;/li&gt;
&lt;li&gt;GPU usage&lt;/li&gt;
&lt;li&gt;Energy consumption&lt;/li&gt;
&lt;li&gt;Exit codes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These statistics are later visible through:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sacct
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  What Happens When the Job Finishes?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  After completion:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Resources are released&lt;/li&gt;
&lt;li&gt;Node state is updated&lt;/li&gt;
&lt;li&gt;Accounting data is stored&lt;/li&gt;
&lt;li&gt;Scheduler reevaluates pending jobs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The released resources immediately become available for new allocations.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Understanding This Matters
&lt;/h2&gt;

&lt;p&gt;Knowing how Slurm allocates resources helps administrators and users:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Troubleshoot pending jobs&lt;/li&gt;
&lt;li&gt;Optimize scheduling&lt;/li&gt;
&lt;li&gt;Improve cluster utilization&lt;/li&gt;
&lt;li&gt;Diagnose CPU or memory contention&lt;/li&gt;
&lt;li&gt;Tune fairshare policies&lt;/li&gt;
&lt;li&gt;Understand performance bottlenecks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It also makes debugging much easier when dealing with issues like:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ReqNodeNotAvail&lt;br&gt;
Resources&lt;br&gt;
Priority&lt;br&gt;
QOSMaxCpuPerUserLimit&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Slurm does much more than simply queue jobs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Internally, it performs:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Policy validation&lt;/li&gt;
&lt;li&gt;Priority calculations&lt;/li&gt;
&lt;li&gt;Topology aware scheduling&lt;/li&gt;
&lt;li&gt;Precise resource accounting&lt;/li&gt;
&lt;li&gt;Cgroup enforcement&lt;/li&gt;
&lt;li&gt;Distributed task launching&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Understanding these internals gives HPC administrators better control over cluster performance and helps users write more efficient jobs.&lt;/p&gt;

&lt;p&gt;The next time you run sbatch, remember that an entire scheduling engine is working behind the scenes to decide exactly where and how your workload should run.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>hpc</category>
      <category>slurm</category>
      <category>productivity</category>
    </item>
    <item>
      <title>InfiniBand vs Omni Path vs Ethernet for AI Workloads</title>
      <dc:creator>Muhammad Zubair Bin Akbar</dc:creator>
      <pubDate>Mon, 11 May 2026 19:27:00 +0000</pubDate>
      <link>https://dev.to/zubairakbar/infiniband-vs-omni-path-vs-ethernet-for-ai-workloads-55dh</link>
      <guid>https://dev.to/zubairakbar/infiniband-vs-omni-path-vs-ethernet-for-ai-workloads-55dh</guid>
      <description>&lt;p&gt;AI workloads are pushing HPC and data center networks harder than ever. Training large language models, distributed deep learning, and high speed data pipelines depend heavily on fast interconnects between compute nodes.&lt;/p&gt;

&lt;p&gt;When GPUs spend more time waiting for data than processing it, the network becomes the bottleneck.&lt;/p&gt;

&lt;p&gt;Three major networking technologies are commonly discussed in AI and HPC environments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;InfiniBand&lt;/li&gt;
&lt;li&gt;Intel Omni Path&lt;/li&gt;
&lt;li&gt;Ethernet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each comes with different strengths, trade offs, and real world use cases.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Network Fabric Matters in AI
&lt;/h2&gt;

&lt;p&gt;Modern AI training is rarely limited to a single GPU or node.&lt;/p&gt;

&lt;h3&gt;
  
  
  Distributed frameworks like:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;PyTorch DDP&lt;/li&gt;
&lt;li&gt;DeepSpeed&lt;/li&gt;
&lt;li&gt;Horovod&lt;/li&gt;
&lt;li&gt;TensorFlow Distributed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;constantly exchange gradients, parameters, and synchronization data between nodes.&lt;/p&gt;

&lt;p&gt;The faster this communication happens, the better the training performance scales.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key factors include:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Latency&lt;/li&gt;
&lt;li&gt;Bandwidth&lt;/li&gt;
&lt;li&gt;RDMA support&lt;/li&gt;
&lt;li&gt;Scalability&lt;/li&gt;
&lt;li&gt;Congestion handling&lt;/li&gt;
&lt;li&gt;GPU communication efficiency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  1. InfiniBand
&lt;/h2&gt;

&lt;p&gt;NVIDIA InfiniBand is considered the gold standard for high performance AI and HPC clusters.&lt;/p&gt;

&lt;p&gt;It is designed specifically for ultra low latency and extremely high throughput communication.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Features
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;RDMA (Remote Direct Memory Access)&lt;/li&gt;
&lt;li&gt;GPUDirect RDMA support&lt;/li&gt;
&lt;li&gt;Very low latency&lt;/li&gt;
&lt;li&gt;High bandwidth (HDR, NDR generations)&lt;/li&gt;
&lt;li&gt;Adaptive routing&lt;/li&gt;
&lt;li&gt;Lossless communication&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why AI Clusters Love InfiniBand
&lt;/h3&gt;

&lt;p&gt;Large AI workloads generate massive all reduce traffic between GPUs.&lt;/p&gt;

&lt;p&gt;InfiniBand performs exceptionally well because it minimizes CPU involvement and allows direct GPU to GPU communication across nodes.&lt;/p&gt;

&lt;h3&gt;
  
  
  This improves:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Multi node GPU scaling&lt;/li&gt;
&lt;li&gt;Training efficiency&lt;/li&gt;
&lt;li&gt;Synchronization speed&lt;/li&gt;
&lt;li&gt;Cluster utilization&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Common Use Cases
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Large scale LLM training&lt;/li&gt;
&lt;li&gt;HPC supercomputers&lt;/li&gt;
&lt;li&gt;GPU heavy AI clusters&lt;/li&gt;
&lt;li&gt;Research environments&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Limitations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Expensive hardware&lt;/li&gt;
&lt;li&gt;Complex deployment&lt;/li&gt;
&lt;li&gt;Specialized networking expertise required&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Omni Path
&lt;/h2&gt;

&lt;p&gt;Intel Omni Path was Intel’s answer to InfiniBand for HPC environments.&lt;/p&gt;

&lt;p&gt;It focused on delivering high throughput with strong scalability at a potentially lower cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Features
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Low latency fabric&lt;/li&gt;
&lt;li&gt;High port density&lt;/li&gt;
&lt;li&gt;Efficient MPI communication&lt;/li&gt;
&lt;li&gt;Good scalability for HPC workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Strengths
&lt;/h3&gt;

&lt;p&gt;Omni Path performed well in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MPI based HPC clusters&lt;/li&gt;
&lt;li&gt;Scientific simulations&lt;/li&gt;
&lt;li&gt;CPU centric workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It also reduced switch complexity in some deployments due to its architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  Challenges for AI Workloads
&lt;/h3&gt;

&lt;p&gt;While Omni Path worked well for traditional HPC, it struggled to gain traction in GPU dominated AI ecosystems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reasons included:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Limited GPU ecosystem support&lt;/li&gt;
&lt;li&gt;Less mature GPUDirect integration&lt;/li&gt;
&lt;li&gt;Smaller vendor ecosystem&lt;/li&gt;
&lt;li&gt;Reduced industry adoption over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Today, most modern AI deployments lean toward InfiniBand or high speed Ethernet instead.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Ethernet
&lt;/h2&gt;

&lt;p&gt;Broadcom and other vendors continue pushing Ethernet into AI networking with higher speeds like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;100GbE&lt;/li&gt;
&lt;li&gt;200GbE&lt;/li&gt;
&lt;li&gt;400GbE&lt;/li&gt;
&lt;li&gt;800GbE&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ethernet remains the most widely deployed networking technology globally.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Features
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Easy integration&lt;/li&gt;
&lt;li&gt;Lower cost&lt;/li&gt;
&lt;li&gt;Massive ecosystem support&lt;/li&gt;
&lt;li&gt;Simpler operations&lt;/li&gt;
&lt;li&gt;Familiar tooling&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Ethernet in Modern AI
&lt;/h3&gt;

&lt;p&gt;Traditional Ethernet had higher latency compared to InfiniBand, but newer technologies have improved performance significantly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Examples include:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;RoCE (RDMA over Converged Ethernet)&lt;/li&gt;
&lt;li&gt;SmartNICs&lt;/li&gt;
&lt;li&gt;DPU acceleration&lt;/li&gt;
&lt;li&gt;Lossless Ethernet configurations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many organizations now run AI workloads successfully on high speed Ethernet fabrics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strengths
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Cost effective scaling&lt;/li&gt;
&lt;li&gt;Easier maintenance&lt;/li&gt;
&lt;li&gt;Better compatibility with enterprise environments&lt;/li&gt;
&lt;li&gt;Flexible vendor choices&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Weaknesses
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Usually higher latency than InfiniBand&lt;/li&gt;
&lt;li&gt;Congestion tuning can become complex&lt;/li&gt;
&lt;li&gt;RoCE requires careful configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Which One Should You Choose?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Choose InfiniBand if:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;You train large AI models&lt;/li&gt;
&lt;li&gt;You run multi node GPU clusters&lt;/li&gt;
&lt;li&gt;Maximum performance matters&lt;/li&gt;
&lt;li&gt;Budget is less of a concern&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Choose Omni Path if:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;You already operate Intel HPC infrastructure&lt;/li&gt;
&lt;li&gt;Your workloads are MPI heavy&lt;/li&gt;
&lt;li&gt;GPU scaling is not the main priority&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Choose Ethernet if:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;You want operational simplicity&lt;/li&gt;
&lt;li&gt;You need enterprise compatibility&lt;/li&gt;
&lt;li&gt;Budget matters&lt;/li&gt;
&lt;li&gt;Your AI workloads are medium scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;There is no universal winner.&lt;/p&gt;

&lt;h3&gt;
  
  
  The right interconnect depends on:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Workload type&lt;/li&gt;
&lt;li&gt;Cluster scale&lt;/li&gt;
&lt;li&gt;Budget&lt;/li&gt;
&lt;li&gt;GPU usage&lt;/li&gt;
&lt;li&gt;Operational expertise&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For cutting edge AI training, InfiniBand still dominates performance focused deployments.&lt;/p&gt;

&lt;p&gt;For enterprise AI environments, Ethernet continues evolving rapidly and closing the gap.&lt;/p&gt;

&lt;p&gt;Omni Path played an important role in HPC networking, but its presence in modern AI infrastructure has become much smaller compared to InfiniBand and Ethernet.&lt;/p&gt;

&lt;p&gt;As AI clusters continue growing, networking decisions are becoming just as important as CPU and GPU selection.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>networking</category>
      <category>hpc</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How HPC Clusters Accelerate AI/ML Training</title>
      <dc:creator>Muhammad Zubair Bin Akbar</dc:creator>
      <pubDate>Sat, 09 May 2026 21:36:44 +0000</pubDate>
      <link>https://dev.to/zubairakbar/how-hpc-clusters-accelerate-aiml-training-15a2</link>
      <guid>https://dev.to/zubairakbar/how-hpc-clusters-accelerate-aiml-training-15a2</guid>
      <description>&lt;p&gt;Artificial Intelligence and Machine Learning are growing faster than ever. From large language models to computer vision and scientific simulations, modern AI workloads require massive computing power.&lt;/p&gt;

&lt;p&gt;Training a model on a normal workstation can take days, weeks, or even months. This is where High Performance Computing, also known as HPC, becomes extremely valuable.&lt;/p&gt;

&lt;p&gt;An HPC cluster allows researchers, engineers, startups, and enterprises to train AI models faster, process larger datasets, and scale workloads efficiently.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an HPC Cluster?
&lt;/h2&gt;

&lt;p&gt;An HPC cluster is a group of interconnected servers working together as a single powerful computing environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  These clusters usually contain:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Multiple compute nodes
&lt;/li&gt;
&lt;li&gt;High core count CPUs
&lt;/li&gt;
&lt;li&gt;Powerful GPUs
&lt;/li&gt;
&lt;li&gt;High speed networking
&lt;/li&gt;
&lt;li&gt;Parallel storage systems
&lt;/li&gt;
&lt;li&gt;Job scheduling software like Slurm
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Instead of relying on a single machine, workloads are distributed across many systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why AI and ML Need HPC
&lt;/h2&gt;

&lt;p&gt;Modern AI training involves billions of calculations. Large datasets and deep neural networks demand huge computational resources.&lt;/p&gt;

&lt;h3&gt;
  
  
  Without HPC infrastructure, organizations often face:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Slow training times
&lt;/li&gt;
&lt;li&gt;GPU bottlenecks
&lt;/li&gt;
&lt;li&gt;Memory limitations
&lt;/li&gt;
&lt;li&gt;Storage performance issues
&lt;/li&gt;
&lt;li&gt;Scaling challenges
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;HPC solves these problems by providing distributed computing and parallel execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Faster Model Training
&lt;/h2&gt;

&lt;p&gt;One of the biggest advantages of HPC is reduced training time.&lt;/p&gt;

&lt;p&gt;For example, training a deep learning model on a single GPU may take several days. Using an HPC cluster with multiple GPUs across several nodes can reduce this time dramatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Frameworks such as:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;PyTorch
&lt;/li&gt;
&lt;li&gt;TensorFlow
&lt;/li&gt;
&lt;li&gt;Horovod
&lt;/li&gt;
&lt;li&gt;DeepSpeed
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;can distribute training across many GPUs simultaneously.&lt;/p&gt;

&lt;p&gt;This allows data parallelism and model parallelism at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Efficient GPU Utilization
&lt;/h2&gt;

&lt;p&gt;GPUs are expensive resources. HPC clusters help maximize GPU usage efficiently.&lt;/p&gt;

&lt;h3&gt;
  
  
  Schedulers like Slurm can:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Allocate GPUs dynamically
&lt;/li&gt;
&lt;li&gt;Queue workloads efficiently
&lt;/li&gt;
&lt;li&gt;Prevent resource conflicts
&lt;/li&gt;
&lt;li&gt;Improve overall cluster utilization
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This ensures that GPUs remain productive instead of sitting idle.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scalability for Large Datasets
&lt;/h2&gt;

&lt;p&gt;AI models continue to grow in size. Datasets now reach terabytes or even petabytes.&lt;/p&gt;

&lt;h3&gt;
  
  
  HPC clusters provide scalable storage systems such as:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Lustre
&lt;/li&gt;
&lt;li&gt;BeeGFS
&lt;/li&gt;
&lt;li&gt;GPFS
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These parallel file systems allow high speed data access from multiple nodes at the same time.&lt;/p&gt;

&lt;p&gt;As a result, training pipelines become faster and more reliable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Distributed Training Made Easier
&lt;/h2&gt;

&lt;p&gt;Modern AI frameworks are designed to work well with HPC environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using technologies like:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;NCCL
&lt;/li&gt;
&lt;li&gt;MPI
&lt;/li&gt;
&lt;li&gt;RDMA
&lt;/li&gt;
&lt;li&gt;Omni Path or InfiniBand networking
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;clusters can achieve low latency communication between GPUs and compute nodes.&lt;/p&gt;

&lt;p&gt;This becomes critical when training large transformer models or running multi GPU workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  Better Resource Sharing
&lt;/h2&gt;

&lt;p&gt;HPC clusters are ideal for universities, research labs, and enterprises where many users need access to computing resources.&lt;/p&gt;

&lt;p&gt;Instead of every team purchasing separate hardware, a centralized HPC environment allows shared access to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;GPUs
&lt;/li&gt;
&lt;li&gt;CPUs
&lt;/li&gt;
&lt;li&gt;Memory
&lt;/li&gt;
&lt;li&gt;Storage
&lt;/li&gt;
&lt;li&gt;Software environments
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This reduces cost and improves operational efficiency.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Use Cases That Benefit from HPC
&lt;/h2&gt;

&lt;h3&gt;
  
  
  HPC clusters are widely used for:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Large Language Models
&lt;/li&gt;
&lt;li&gt;Computer Vision
&lt;/li&gt;
&lt;li&gt;Medical Imaging
&lt;/li&gt;
&lt;li&gt;Weather Prediction
&lt;/li&gt;
&lt;li&gt;Drug Discovery
&lt;/li&gt;
&lt;li&gt;Financial Modeling
&lt;/li&gt;
&lt;li&gt;Autonomous Vehicle Research
&lt;/li&gt;
&lt;li&gt;Scientific Simulations
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Many of these workloads are impossible to run efficiently on a single machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenges to Consider
&lt;/h2&gt;

&lt;p&gt;Although HPC offers major advantages, there are still challenges:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Infrastructure cost
&lt;/li&gt;
&lt;li&gt;Power and cooling requirements
&lt;/li&gt;
&lt;li&gt;GPU availability
&lt;/li&gt;
&lt;li&gt;Network complexity
&lt;/li&gt;
&lt;li&gt;Cluster management
&lt;/li&gt;
&lt;li&gt;Software compatibility
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;However, the long term performance gains usually outweigh the initial setup effort.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;AI and Machine Learning workloads are becoming increasingly demanding. Traditional systems are often not enough to handle modern training requirements.&lt;/p&gt;

&lt;p&gt;HPC clusters provide the computing power, scalability, and efficiency needed for advanced AI development.&lt;/p&gt;

&lt;p&gt;Whether you are training deep learning models, processing massive datasets, or running distributed workloads, HPC can significantly accelerate your AI journey.&lt;/p&gt;

&lt;p&gt;As AI continues to evolve, HPC infrastructure will become even more important for research and innovation.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>highperformancecomputing</category>
      <category>productivity</category>
    </item>
    <item>
      <title>NFS vs Parallel File Systems in HPC: How to Choose the Right Storage Architecture</title>
      <dc:creator>Muhammad Zubair Bin Akbar</dc:creator>
      <pubDate>Thu, 07 May 2026 21:49:09 +0000</pubDate>
      <link>https://dev.to/zubairakbar/nfs-vs-parallel-file-systems-in-hpc-how-to-choose-the-right-storage-architecture-4dfn</link>
      <guid>https://dev.to/zubairakbar/nfs-vs-parallel-file-systems-in-hpc-how-to-choose-the-right-storage-architecture-4dfn</guid>
      <description>&lt;p&gt;When building or expanding an HPC cluster, one of the biggest architectural decisions is storage design. Many small and mid-sized clusters start with NFS because it is simple, reliable, and easy to manage. But as workloads grow, storage often becomes the hidden bottleneck.&lt;/p&gt;

&lt;p&gt;So the real question is:&lt;/p&gt;

&lt;p&gt;When is NFS enough, and when does an HPC cluster actually require a parallel file system like Lustre, BeeGFS, or GPFS?&lt;/p&gt;

&lt;p&gt;This article breaks down the practical factors that help HPC admins make that decision.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the Difference
&lt;/h2&gt;

&lt;h3&gt;
  
  
  NFS (Network File System)
&lt;/h3&gt;

&lt;p&gt;NFS is a centralized file-sharing system where compute nodes access data from a single storage server.&lt;/p&gt;

&lt;p&gt;Why admins love it&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Easy to configure&lt;/li&gt;
&lt;li&gt;Minimal infrastructure&lt;/li&gt;
&lt;li&gt;Simple backups&lt;/li&gt;
&lt;li&gt;Lower operational overhead&lt;/li&gt;
&lt;li&gt;Great for small clusters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Common HPC usage&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Home directories&lt;/li&gt;
&lt;li&gt;Software repositories&lt;/li&gt;
&lt;li&gt;Small research workloads&lt;/li&gt;
&lt;li&gt;Shared scripts and configuration files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h3&gt;
  
  
  Parallel File Systems
&lt;/h3&gt;

&lt;p&gt;A parallel file system distributes storage operations across multiple servers and disks simultaneously.&lt;/p&gt;

&lt;p&gt;Examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lustre&lt;/li&gt;
&lt;li&gt;BeeGFS&lt;/li&gt;
&lt;li&gt;IBM GPFS / Spectrum Scale&lt;/li&gt;
&lt;li&gt;WekaFS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why they exist&lt;/p&gt;

&lt;p&gt;They are designed for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Massive throughput&lt;/li&gt;
&lt;li&gt;High concurrency&lt;/li&gt;
&lt;li&gt;Thousands of simultaneous reads/writes&lt;/li&gt;
&lt;li&gt;Large-scale HPC and AI workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Decision: Workload, Not Cluster Size
&lt;/h2&gt;

&lt;p&gt;One of the biggest misconceptions is:&lt;/p&gt;

&lt;p&gt;“Large cluster = parallel file system.”&lt;/p&gt;

&lt;p&gt;Not always.&lt;/p&gt;

&lt;p&gt;A 500-node cluster running lightweight CPU simulations may work perfectly fine with NFS.&lt;/p&gt;

&lt;p&gt;Meanwhile, a 20-node GPU AI cluster can completely overwhelm NFS in days.&lt;/p&gt;

&lt;h3&gt;
  
  
  The decision depends more on:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;I/O behavior&lt;/li&gt;
&lt;li&gt;Data size&lt;/li&gt;
&lt;li&gt;Concurrency&lt;/li&gt;
&lt;li&gt;Metadata pressure&lt;/li&gt;
&lt;li&gt;Performance expectations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Factors That Decide Between NFS and Parallel Storage
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Number of Concurrent Jobs
&lt;/h3&gt;

&lt;p&gt;This is usually the first warning sign.&lt;/p&gt;

&lt;p&gt;NFS works well when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Few jobs access storage simultaneously&lt;/li&gt;
&lt;li&gt;Workloads are mostly compute-heavy&lt;/li&gt;
&lt;li&gt;Files are read occasionally&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Problems start when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hundreds of jobs hit storage together&lt;/li&gt;
&lt;li&gt;Many users submit jobs simultaneously&lt;/li&gt;
&lt;li&gt;Applications continuously read/write checkpoints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Symptoms&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Jobs stuck in I/O wait&lt;/li&gt;
&lt;li&gt;Slow application startup&lt;/li&gt;
&lt;li&gt;Hanging MPI jobs&lt;/li&gt;
&lt;li&gt;High NFS server load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your storage server becomes the cluster bottleneck, parallel storage should be considered.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h3&gt;
  
  
  2. I/O Pattern of Applications
&lt;/h3&gt;

&lt;p&gt;Different applications stress storage differently.&lt;/p&gt;

&lt;p&gt;NFS handles well:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sequential reads&lt;/li&gt;
&lt;li&gt;Small user datasets&lt;/li&gt;
&lt;li&gt;Software sharing&lt;/li&gt;
&lt;li&gt;Log files&lt;/li&gt;
&lt;li&gt;Light checkpointing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Parallel file systems are better for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Large checkpoint files&lt;/li&gt;
&lt;li&gt;Frequent writes&lt;/li&gt;
&lt;li&gt;Multi-node parallel reads&lt;/li&gt;
&lt;li&gt;AI training datasets&lt;/li&gt;
&lt;li&gt;CFD and FEM simulations&lt;/li&gt;
&lt;li&gt;Genomics pipelines&lt;/li&gt;
&lt;li&gt;High-throughput workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example&lt;/p&gt;

&lt;p&gt;A simulation writing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 GB every hour → NFS is usually fine&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A deep learning job where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;32 GPUs constantly read millions of small images → NFS may collapse quickly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Metadata Operations
&lt;/h3&gt;

&lt;p&gt;This is one of the most ignored storage bottlenecks in HPC.&lt;/p&gt;

&lt;p&gt;Metadata operations include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Opening files&lt;/li&gt;
&lt;li&gt;Closing files&lt;/li&gt;
&lt;li&gt;Listing directories&lt;/li&gt;
&lt;li&gt;Creating small files&lt;/li&gt;
&lt;li&gt;File existence checks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI and genomics workloads often generate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Millions of tiny files&lt;/li&gt;
&lt;li&gt;Heavy directory scans&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;NFS struggles badly under metadata storms because a single server handles everything.&lt;/p&gt;

&lt;p&gt;Parallel file systems distribute metadata handling across multiple servers.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Storage Throughput Requirements
&lt;/h3&gt;

&lt;p&gt;Ask yourself:&lt;/p&gt;

&lt;p&gt;How much aggregate bandwidth does the cluster need?&lt;/p&gt;

&lt;p&gt;Example&lt;/p&gt;

&lt;p&gt;If:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;50 nodes each require 500 MB/s&lt;/li&gt;
&lt;li&gt;Total required throughput = 25 GB/s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A single NFS server is unlikely to sustain this consistently.&lt;/p&gt;

&lt;p&gt;Parallel storage is specifically designed for aggregate throughput scaling.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h3&gt;
  
  
  5. GPU Workloads
&lt;/h3&gt;

&lt;p&gt;GPU clusters expose storage weaknesses extremely fast.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because GPUs process data faster than CPUs and can become idle waiting for storage.&lt;/p&gt;

&lt;p&gt;Common signs&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU utilization drops&lt;/li&gt;
&lt;li&gt;Data loader bottlenecks&lt;/li&gt;
&lt;li&gt;Training stalls&lt;/li&gt;
&lt;li&gt;NCCL timeout side effects&lt;/li&gt;
&lt;li&gt;Slow checkpoint saves&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For modern AI clusters, storage throughput becomes just as important as GPU performance.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Checkpointing Frequency
&lt;/h3&gt;

&lt;p&gt;Large HPC jobs periodically save state to disk.&lt;/p&gt;

&lt;p&gt;This is called checkpointing.&lt;/p&gt;

&lt;p&gt;NFS struggles when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hundreds of jobs checkpoint together&lt;/li&gt;
&lt;li&gt;Checkpoint files are huge&lt;/li&gt;
&lt;li&gt;Writes occur frequently&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This creates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I/O spikes&lt;/li&gt;
&lt;li&gt;Server saturation&lt;/li&gt;
&lt;li&gt;Job slowdowns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Parallel file systems distribute write operations and handle burst traffic much better.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Scalability Expectations
&lt;/h3&gt;

&lt;p&gt;Think beyond today.&lt;/p&gt;

&lt;p&gt;NFS is usually enough for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Labs&lt;/li&gt;
&lt;li&gt;University research groups&lt;/li&gt;
&lt;li&gt;Small clusters&lt;/li&gt;
&lt;li&gt;Development environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Parallel storage becomes attractive when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cluster growth is expected&lt;/li&gt;
&lt;li&gt;More users are added regularly&lt;/li&gt;
&lt;li&gt;GPU adoption increases&lt;/li&gt;
&lt;li&gt;Storage demand grows every quarter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Migrating later is possible, but painful.&lt;/p&gt;

&lt;p&gt;Planning early saves operational headaches.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h3&gt;
  
  
  8. High Availability Requirements
&lt;/h3&gt;

&lt;p&gt;With NFS:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One storage server often becomes a single point of failure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If that server goes down:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Jobs fail&lt;/li&gt;
&lt;li&gt;Mounts freeze&lt;/li&gt;
&lt;li&gt;Users lose access&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Parallel file systems typically support:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redundant metadata servers&lt;/li&gt;
&lt;li&gt;Distributed storage targets&lt;/li&gt;
&lt;li&gt;Better failover models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This matters heavily in production HPC environments.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  When NFS Is Completely Fine
&lt;/h2&gt;

&lt;p&gt;NFS is still a perfectly valid HPC solution when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cluster size is small or medium&lt;/li&gt;
&lt;li&gt;Workloads are CPU-heavy&lt;/li&gt;
&lt;li&gt;I/O demand is modest&lt;/li&gt;
&lt;li&gt;User count is limited&lt;/li&gt;
&lt;li&gt;Budgets are constrained&lt;/li&gt;
&lt;li&gt;Simulations are compute-bound&lt;/li&gt;
&lt;li&gt;Storage traffic is predictable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many successful HPC environments run on NFS for years without major issues.&lt;/p&gt;

&lt;p&gt;Do not deploy complex parallel storage just because it sounds “enterprise.”&lt;/p&gt;

&lt;p&gt;Operational simplicity matters.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  When a Parallel File System Becomes Necessary
&lt;/h2&gt;

&lt;p&gt;You should seriously evaluate parallel storage if you observe:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High I/O wait times&lt;/li&gt;
&lt;li&gt;Saturated NFS server CPU/network&lt;/li&gt;
&lt;li&gt;GPU starvation&lt;/li&gt;
&lt;li&gt;Slow checkpointing&lt;/li&gt;
&lt;li&gt;Metadata bottlenecks&lt;/li&gt;
&lt;li&gt;Thousands of simultaneous file operations&lt;/li&gt;
&lt;li&gt;Multi-GB/s throughput demand&lt;/li&gt;
&lt;li&gt;Frequent user complaints about storage slowness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point, storage is no longer infrastructure.&lt;/p&gt;

&lt;p&gt;It becomes part of application performance.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Rule of Thumb
&lt;/h2&gt;

&lt;p&gt;Stay with NFS if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Storage is not your bottleneck&lt;/li&gt;
&lt;li&gt;Applications are compute-heavy&lt;/li&gt;
&lt;li&gt;Simplicity is more valuable than scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Move to parallel storage if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Storage limits job performance&lt;/li&gt;
&lt;li&gt;GPU utilization suffers&lt;/li&gt;
&lt;li&gt;I/O scales faster than compute&lt;/li&gt;
&lt;li&gt;Metadata load becomes extreme&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;There is no universal answer in HPC storage architecture.&lt;/p&gt;

&lt;p&gt;The best storage system is not the most advanced one.&lt;/p&gt;

&lt;p&gt;It is the one that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Matches workload behavior&lt;/li&gt;
&lt;li&gt;Scales with demand&lt;/li&gt;
&lt;li&gt;Stays operationally manageable&lt;/li&gt;
&lt;li&gt;Delivers consistent performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For many clusters, NFS remains the right choice.&lt;/p&gt;

&lt;p&gt;But once storage starts limiting compute performance, a parallel file system stops being optional and becomes necessary infrastructure.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

</description>
      <category>ai</category>
      <category>hpc</category>
      <category>filesystems</category>
      <category>networking</category>
    </item>
    <item>
      <title>How Modules Work in HPC</title>
      <dc:creator>Muhammad Zubair Bin Akbar</dc:creator>
      <pubDate>Wed, 06 May 2026 15:31:47 +0000</pubDate>
      <link>https://dev.to/zubairakbar/how-modules-work-in-hpc-5be9</link>
      <guid>https://dev.to/zubairakbar/how-modules-work-in-hpc-5be9</guid>
      <description>&lt;p&gt;If you have ever logged into an HPC cluster and typed something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;module load gcc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;…you have already used one of the most important tools in HPC environments, &lt;strong&gt;Lmod&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But what’s actually happening behind the scenes? And why do we even need modules in the first place?&lt;/p&gt;

&lt;p&gt;Let’s break it down in a simple, practical way.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Too Many Software Versions
&lt;/h2&gt;

&lt;p&gt;HPC systems are shared by many users, and different projects often need different versions of the same software.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One user needs &lt;strong&gt;Python 3.8&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Another needs &lt;strong&gt;Python 3.11&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Someone else depends on a specific &lt;strong&gt;GCC compiler version&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Installing everything globally would create conflicts and chaos.&lt;/p&gt;

&lt;p&gt;So instead of forcing one version on everyone, HPC systems use &lt;strong&gt;environment modules&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Lmod Actually Does
&lt;/h2&gt;

&lt;p&gt;Lmod is a system that dynamically modifies your shell environment so you can switch between software versions easily.&lt;/p&gt;

&lt;p&gt;When you run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;module load python/3.11
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lmod:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Updates your &lt;code&gt;PATH&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Sets environment variables like &lt;code&gt;LD_LIBRARY_PATH&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Ensures dependencies are correctly configured&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In simple terms:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It prepares your environment so the right software works correctly.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Think of It Like This
&lt;/h2&gt;

&lt;p&gt;Imagine your environment as a workspace.&lt;/p&gt;

&lt;p&gt;Each module you load:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adds tools to your workspace&lt;/li&gt;
&lt;li&gt;Configures them correctly&lt;/li&gt;
&lt;li&gt;Avoids interfering with other tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without modules, you’d have to manually set everything yourself every time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Basic Commands You’ll Use
&lt;/h2&gt;

&lt;h3&gt;
  
  
  List available modules
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;module avail
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Load a module
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;module load gcc/12.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Unload a module
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;module unload gcc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  See what’s currently loaded
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;module list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Swap versions easily
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;module swap python/3.8 python/3.11
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What Are Modulefiles?
&lt;/h2&gt;

&lt;p&gt;Behind every module is a &lt;strong&gt;modulefile&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is just a script (usually written in Lua for Lmod) that tells the system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What paths to add&lt;/li&gt;
&lt;li&gt;What variables to set&lt;/li&gt;
&lt;li&gt;What dependencies to load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example idea:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;prepend_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PATH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/opt/gcc/12.2/bin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You don’t usually need to edit these, but it helps to know they exist.&lt;/p&gt;




&lt;h2&gt;
  
  
  Handling Dependencies Automatically
&lt;/h2&gt;

&lt;p&gt;One of the biggest advantages of Lmod is dependency management.&lt;/p&gt;

&lt;p&gt;If you load something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;module load openmpi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lmod can automatically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Load the correct compiler&lt;/li&gt;
&lt;li&gt;Avoid incompatible versions&lt;/li&gt;
&lt;li&gt;Prevent conflicts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This saves a lot of debugging time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Gotchas
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Mixing incompatible modules
&lt;/h3&gt;

&lt;p&gt;Loading different compilers and MPI stacks together can break things.&lt;/p&gt;

&lt;p&gt;Stick to consistent toolchains.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Forgetting to load modules in job scripts
&lt;/h3&gt;

&lt;p&gt;What works in your shell might fail in Slurm if modules aren’t loaded.&lt;/p&gt;

&lt;p&gt;Always include:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;module load &amp;lt;required-modules&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  3. Dirty environments
&lt;/h3&gt;

&lt;p&gt;If things behave strangely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;module purge
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This resets everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Lmod Matters in HPC
&lt;/h2&gt;

&lt;p&gt;Lmod makes HPC usable at scale by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Avoiding software conflicts&lt;/li&gt;
&lt;li&gt;Supporting multiple users and workflows&lt;/li&gt;
&lt;li&gt;Simplifying environment setup&lt;/li&gt;
&lt;li&gt;Making jobs reproducible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without it, managing software on clusters would be painful and error prone.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;You don’t need to understand every detail of Lmod to use it effectively.&lt;/p&gt;

&lt;p&gt;Just remember:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Modules control your environment.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Your environment controls your results.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once you get comfortable with modules, debugging HPC jobs becomes much easier.&lt;/p&gt;

</description>
      <category>hpc</category>
      <category>tutorial</category>
      <category>beginners</category>
      <category>modules</category>
    </item>
    <item>
      <title>Inside Job Logs: What to Look For When Things Break</title>
      <dc:creator>Muhammad Zubair Bin Akbar</dc:creator>
      <pubDate>Mon, 04 May 2026 22:37:22 +0000</pubDate>
      <link>https://dev.to/zubairakbar/inside-job-logs-what-to-look-for-when-things-break-5gnk</link>
      <guid>https://dev.to/zubairakbar/inside-job-logs-what-to-look-for-when-things-break-5gnk</guid>
      <description>&lt;p&gt;When a job fails on an HPC cluster, your first instinct might be to rerun it and hope for a different outcome. That rarely works. The real answers are almost always sitting quietly in your job logs.&lt;/p&gt;

&lt;p&gt;Understanding how to read those logs effectively can save hours of guesswork and help you fix issues faster and more confidently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start With the Basics: Exit Codes
&lt;/h2&gt;

&lt;p&gt;Every job finishes with an exit code. This is the simplest signal of what happened.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;0 means success
&lt;/li&gt;
&lt;li&gt;Non-zero values indicate failure
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In Slurm, you will often see something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ExitCode=1:0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first number is the job’s exit status, and the second is the signal. If the signal is non-zero, it usually points to something more abrupt, like a kill or crash.&lt;/p&gt;

&lt;h2&gt;
  
  
  Check Standard Output and Error Files
&lt;/h2&gt;

&lt;p&gt;Slurm writes logs to files like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;slurm-&amp;lt;jobid&amp;gt;.out
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or custom paths defined in your job script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#SBATCH --output=job.out #SBATCH --error=job.err&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These files are your primary source of truth.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stdout shows normal program output
&lt;/li&gt;
&lt;li&gt;stderr shows warnings, errors, and crashes
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Always read stderr first when debugging.&lt;/p&gt;

&lt;h2&gt;
  
  
  Look for the First Error, Not the Last
&lt;/h2&gt;

&lt;p&gt;A common mistake is focusing on the last line of the log. In reality, the root cause often appears much earlier.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;File not found: input.dat Segmentation fault (core dumped)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The segmentation fault is just a consequence. The missing file is the real issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  Memory Issues: Subtle but Common
&lt;/h2&gt;

&lt;p&gt;Memory problems show up in different ways depending on how the system enforces limits.&lt;/p&gt;

&lt;p&gt;Typical signs include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Out Of Memory&lt;/li&gt;
&lt;li&gt;Killed&lt;/li&gt;
&lt;li&gt;oom-kill event&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In Slurm, you might also see:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;slurmstepd: error: Detected 1 oom-kill event(s)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this happens, your job likely exceeded its allocated memory. Increase &lt;code&gt;--mem&lt;/code&gt; or optimize memory usage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Node-Level Failures vs Application Errors
&lt;/h2&gt;

&lt;p&gt;Not every failure is your fault.&lt;/p&gt;

&lt;h3&gt;
  
  
  Application Errors
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Segmentation faults
&lt;/li&gt;
&lt;li&gt;Python tracebacks
&lt;/li&gt;
&lt;li&gt;Missing libraries
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These point to issues in your code or environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  System or Node Issues
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Block device required
&lt;/li&gt;
&lt;li&gt;I/O error
&lt;/li&gt;
&lt;li&gt;Node unreachable messages
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These suggest problems with the compute node, filesystem, or scheduler.&lt;/p&gt;

&lt;p&gt;If multiple jobs fail on the same node, it’s a strong signal of a node issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  Environment and Dependency Problems
&lt;/h2&gt;

&lt;p&gt;A job might fail simply because something isn’t loaded.&lt;/p&gt;

&lt;p&gt;Look for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;command not found module: not found libXYZ.so: cannot open shared object file
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These errors usually mean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Missing modules
&lt;/li&gt;
&lt;li&gt;Incorrect environment setup
&lt;/li&gt;
&lt;li&gt;Wrong software versions
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Double-check your module loads and environment variables.&lt;/p&gt;

&lt;h2&gt;
  
  
  MPI and Multi-Node Clues
&lt;/h2&gt;

&lt;p&gt;For parallel jobs, logs can get noisy. Focus on patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rank-specific failures
&lt;/li&gt;
&lt;li&gt;Communication errors
&lt;/li&gt;
&lt;li&gt;Timeouts
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples include:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MPI_ABORT was invoked NCCL error connection timed out
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These often point to network issues, misconfiguration, or mismatched libraries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Timing and Resource Clues
&lt;/h2&gt;

&lt;p&gt;Sometimes the issue isn’t a crash, but inefficiency or limits.&lt;/p&gt;

&lt;p&gt;Look for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Jobs stopping exactly at walltime
&lt;/li&gt;
&lt;li&gt;Slow startup or long idle times
&lt;/li&gt;
&lt;li&gt;Uneven resource usage
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Slurm accounting tools like sacct and seff can complement logs and give a clearer picture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build a Debugging Habit
&lt;/h2&gt;

&lt;p&gt;Instead of reacting randomly to failures, follow a consistent approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check exit code
&lt;/li&gt;
&lt;li&gt;Read stderr from top to bottom
&lt;/li&gt;
&lt;li&gt;Identify the first real error
&lt;/li&gt;
&lt;li&gt;Correlate with resource usage and job settings
&lt;/li&gt;
&lt;li&gt;Verify environment and dependencies
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Over time, patterns become familiar, and debugging gets faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Logs are not just noise. They are structured clues about what went wrong and why.&lt;/p&gt;

&lt;p&gt;The more time you spend understanding them, the less time you waste guessing. In HPC environments, that difference matters.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>hpc</category>
      <category>slurm</category>
      <category>help</category>
    </item>
    <item>
      <title>Shared vs Distributed Memory – Why It Matters More Than You Think</title>
      <dc:creator>Muhammad Zubair Bin Akbar</dc:creator>
      <pubDate>Sun, 03 May 2026 22:00:20 +0000</pubDate>
      <link>https://dev.to/zubairakbar/shared-vs-distributed-memory-why-it-matters-more-than-you-think-4al3</link>
      <guid>https://dev.to/zubairakbar/shared-vs-distributed-memory-why-it-matters-more-than-you-think-4al3</guid>
      <description>&lt;p&gt;When people start working with high performance computing or parallel systems, “memory” often sounds like a background detail. It’s not. The way memory is structured can completely change how your applications behave, scale, and even fail.&lt;/p&gt;

&lt;p&gt;Let’s break it down in a practical way.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Shared Memory?
&lt;/h2&gt;

&lt;p&gt;In a shared memory system, all processors access the same memory space.&lt;/p&gt;

&lt;p&gt;Think of it like multiple people working on a single Google Doc. Everyone sees the same data, and changes are immediately visible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key traits:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;One global memory space&lt;/li&gt;
&lt;li&gt;Fast communication between threads&lt;/li&gt;
&lt;li&gt;Easier to program (generally)&lt;/li&gt;
&lt;li&gt;Requires synchronization (locks, semaphores)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Where you see it:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Multi core CPUs&lt;/li&gt;
&lt;li&gt;OpenMP based applications&lt;/li&gt;
&lt;li&gt;Single node parallel jobs&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The catch:
&lt;/h3&gt;

&lt;p&gt;Shared memory doesn’t scale well forever. As you add more cores, contention increases. Memory bandwidth becomes a bottleneck, and performance starts to drop.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Distributed Memory?
&lt;/h2&gt;

&lt;p&gt;In distributed memory systems, each processor (or node) has its own private memory.&lt;/p&gt;

&lt;p&gt;Now imagine each person has their own document, and they email updates to each other. Communication is explicit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key traits:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Separate memory per node&lt;/li&gt;
&lt;li&gt;Communication via message passing&lt;/li&gt;
&lt;li&gt;More control, but more complexity&lt;/li&gt;
&lt;li&gt;Scales much better across machines&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Where you see it:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;HPC clusters&lt;/li&gt;
&lt;li&gt;MPI based applications&lt;/li&gt;
&lt;li&gt;Multi node Slurm jobs&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The catch:
&lt;/h3&gt;

&lt;p&gt;You have to manage communication yourself. Poor data exchange design can kill performance.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Shared vs Distributed: The Real Difference
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Memory Access
&lt;/h3&gt;

&lt;p&gt;In shared memory, everything lives in one global space. Any thread can read or modify data directly.&lt;/p&gt;

&lt;p&gt;In distributed memory, each node has its own local memory. If you need data from another node, you have to explicitly request it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Communication Style
&lt;/h3&gt;

&lt;p&gt;Shared memory systems rely on implicit communication. Threads just read and write to the same variables.&lt;/p&gt;

&lt;p&gt;Distributed systems are explicit. You send and receive messages, often using MPI. Nothing is shared unless you make it shared.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance Behavior
&lt;/h3&gt;

&lt;p&gt;Shared memory is extremely fast at small scale since there’s no network involved.&lt;/p&gt;

&lt;p&gt;Distributed memory shines when scaling out. You can add more nodes, but now you pay the cost of network communication.&lt;/p&gt;

&lt;h3&gt;
  
  
  Complexity
&lt;/h3&gt;

&lt;p&gt;Shared memory is easier to get started with. You can parallelize loops and see quick results.&lt;/p&gt;

&lt;p&gt;Distributed memory requires planning. You need to think about data distribution, communication patterns, and synchronization from the beginning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bottlenecks
&lt;/h3&gt;

&lt;p&gt;Shared memory systems struggle with contention. Too many threads fighting over the same memory slows everything down.&lt;/p&gt;

&lt;p&gt;Distributed systems hit network limits. Latency and bandwidth become the main constraints as you scale.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Actually Matters
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Your Code Design Changes
&lt;/h3&gt;

&lt;p&gt;A shared memory program might rely on simple loops with parallel directives.&lt;/p&gt;

&lt;p&gt;A distributed memory program forces you to think about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data partitioning&lt;/li&gt;
&lt;li&gt;Communication patterns&lt;/li&gt;
&lt;li&gt;Synchronization across nodes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same problem, completely different mindset.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Scaling Isn’t Automatic
&lt;/h3&gt;

&lt;p&gt;A program that runs perfectly on 8 cores might fall apart on 100 nodes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shared memory hits hardware limits&lt;/li&gt;
&lt;li&gt;Distributed memory introduces network overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Understanding the model helps you predict scaling behavior instead of guessing.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Debugging Becomes a Different Game
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Shared memory bugs → race conditions, deadlocks&lt;/li&gt;
&lt;li&gt;Distributed memory bugs → hangs, mismatched sends/receives&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both are painful, just in different ways.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Hybrid is the Reality
&lt;/h3&gt;

&lt;p&gt;Modern HPC systems don’t force you to choose one.&lt;/p&gt;

&lt;p&gt;Most real workloads use a hybrid model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MPI between nodes (distributed)&lt;/li&gt;
&lt;li&gt;OpenMP within a node (shared)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where performance tuning becomes interesting and tricky.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  A Simple Analogy
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Shared memory = One kitchen, many cooks&lt;/li&gt;
&lt;li&gt;Distributed memory = Many kitchens, coordinated recipes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One is easier to manage. The other scales better.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;If you’re working with HPC, cloud scaling, or even large data pipelines, memory architecture isn’t just a technical detail, it’s a design decision.&lt;/p&gt;

&lt;p&gt;Ignoring it leads to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Poor scaling&lt;/li&gt;
&lt;li&gt;Unpredictable performance&lt;/li&gt;
&lt;li&gt;Hard-to-debug systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Understanding it gives you control.&lt;/p&gt;

&lt;p&gt;And in distributed systems, control is everything.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>distributedsystems</category>
      <category>hpc</category>
      <category>networking</category>
    </item>
    <item>
      <title>How MPI Works Under the Hood (Without the Jargon)</title>
      <dc:creator>Muhammad Zubair Bin Akbar</dc:creator>
      <pubDate>Fri, 01 May 2026 21:39:19 +0000</pubDate>
      <link>https://dev.to/zubairakbar/how-mpi-works-under-the-hood-without-the-jargon-44a8</link>
      <guid>https://dev.to/zubairakbar/how-mpi-works-under-the-hood-without-the-jargon-44a8</guid>
      <description>&lt;p&gt;If you have ever run a job on an HPC cluster, chances are you have used MPI without fully knowing what’s happening behind the scenes. And that’s completely normal. MPI often feels like a black box that just “makes parallel jobs work.”&lt;/p&gt;

&lt;p&gt;Let’s open that box a bit, without diving into heavy theory or academic jargon.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  The Basic Idea
&lt;/h2&gt;

&lt;p&gt;MPI (Message Passing Interface) is simply a way for multiple processes to talk to each other while running a program.&lt;/p&gt;

&lt;p&gt;Think of it like this:&lt;/p&gt;

&lt;p&gt;Instead of one program doing all the work, MPI lets you run many copies of the same program. Each copy handles a portion of the task and communicates with others when needed.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Happens When You Run an MPI Job?
&lt;/h2&gt;

&lt;p&gt;When you launch an MPI job using something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;mpirun -np 4 ./my_app
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here’s what’s going on under the hood:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Multiple Processes Are Started
&lt;/h3&gt;

&lt;p&gt;MPI doesn’t create threads. It starts completely separate processes.&lt;/p&gt;

&lt;p&gt;Each process:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Has its own memory space&lt;/li&gt;
&lt;li&gt;Runs independently&lt;/li&gt;
&lt;li&gt;Gets a unique ID called a rank&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Each Process Knows Its Role
&lt;/h3&gt;

&lt;p&gt;Every MPI process gets a rank:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rank 0 → usually the coordinator&lt;/li&gt;
&lt;li&gt;Rank 1, 2, 3… → workers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your code uses these ranks to decide who does what.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Communication Happens via Messages
&lt;/h3&gt;

&lt;p&gt;Processes don’t share memory. Instead, they send and receive messages.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Process 0 sends data → Process 1 receives it&lt;/li&gt;
&lt;li&gt;Process 2 broadcasts something → everyone gets it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the core of MPI.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  What Does “Sending a Message” Really Mean?
&lt;/h2&gt;

&lt;p&gt;When one process sends data:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The data is copied into a buffer&lt;/li&gt;
&lt;li&gt;MPI hands it to the system (network or shared memory)&lt;/li&gt;
&lt;li&gt;It travels to the target process&lt;/li&gt;
&lt;li&gt;The receiving process copies it into its memory&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If processes are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On the same node → shared memory is used&lt;/li&gt;
&lt;li&gt;On different nodes → network (like InfiniBand or Ethernet)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  How MPI Uses the Hardware
&lt;/h2&gt;

&lt;p&gt;MPI is smarter than it looks. It adapts based on where processes are running:&lt;/p&gt;

&lt;p&gt;Same Node&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses shared memory (fast)&lt;/li&gt;
&lt;li&gt;No real “network” involved&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Different Nodes&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses high-speed interconnects&lt;/li&gt;
&lt;li&gt;Optimized protocols to reduce latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Good MPI implementations automatically pick the best method.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Synchronization (Keeping Everyone in Check)
&lt;/h2&gt;

&lt;p&gt;Sometimes processes need to wait for each other.&lt;/p&gt;

&lt;p&gt;MPI provides mechanisms like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Barriers → everyone pauses until all reach a point&lt;/li&gt;
&lt;li&gt;Collective operations → like broadcast, reduce&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures coordination across processes.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  A Simple Mental Model
&lt;/h2&gt;

&lt;p&gt;Imagine a group project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each person (process) works on their part&lt;/li&gt;
&lt;li&gt;They occasionally send updates to others&lt;/li&gt;
&lt;li&gt;One person might collect results and combine everything&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MPI is just the system that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Assigns roles&lt;/li&gt;
&lt;li&gt;Handles communication&lt;/li&gt;
&lt;li&gt;Keeps things in sync&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Things Sometimes Go Wrong
&lt;/h2&gt;

&lt;p&gt;MPI issues often come from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One process waiting for a message that never arrives&lt;/li&gt;
&lt;li&gt;Mismatched send/receive calls&lt;/li&gt;
&lt;li&gt;Network or node issues&lt;/li&gt;
&lt;li&gt;Poor workload distribution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because everything runs independently, small mistakes can cause hangs or failures.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Why MPI Is Still So Widely Used
&lt;/h2&gt;

&lt;p&gt;Despite newer technologies, MPI remains dominant in HPC because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It scales extremely well&lt;/li&gt;
&lt;li&gt;Works across thousands of nodes&lt;/li&gt;
&lt;li&gt;Gives precise control over communication&lt;/li&gt;
&lt;li&gt;Is highly optimized for performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;MPI isn’t magic. It’s just a well-designed system for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Running multiple processes&lt;/li&gt;
&lt;li&gt;Passing messages between them&lt;/li&gt;
&lt;li&gt;Coordinating work efficiently&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once you understand that, debugging and optimizing MPI jobs becomes much easier.&lt;/p&gt;

</description>
      <category>mpi</category>
      <category>hpc</category>
      <category>networking</category>
      <category>ai</category>
    </item>
    <item>
      <title>Bare Metal vs Virtual Machines vs Containers in HPC</title>
      <dc:creator>Muhammad Zubair Bin Akbar</dc:creator>
      <pubDate>Thu, 30 Apr 2026 19:29:11 +0000</pubDate>
      <link>https://dev.to/zubairakbar/bare-metal-vs-virtual-machines-vs-containers-in-hpc-4ple</link>
      <guid>https://dev.to/zubairakbar/bare-metal-vs-virtual-machines-vs-containers-in-hpc-4ple</guid>
      <description>&lt;p&gt;High Performance Computing isn’t just about powerful CPUs and fast interconnects. The way workloads are deployed matters just as much. Whether you’re running simulations, AI training, or large-scale data processing, choosing between bare metal, virtual machines, and containers can directly impact performance, flexibility, and efficiency.&lt;/p&gt;

&lt;p&gt;Let’s break it down in a practical way.&lt;/p&gt;




&lt;h3&gt;
  
  
  Bare Metal: Maximum Performance, Minimum Abstraction
&lt;/h3&gt;

&lt;p&gt;Bare metal means running workloads directly on physical hardware without any virtualization layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why HPC loves it:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full access to CPU, memory, GPUs, and high-speed networks&lt;/li&gt;
&lt;li&gt;No virtualization overhead&lt;/li&gt;
&lt;li&gt;Best for tightly coupled MPI jobs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where it shines:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Large-scale simulations&lt;/li&gt;
&lt;li&gt;CFD, weather modeling&lt;/li&gt;
&lt;li&gt;Latency-sensitive workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Trade-offs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Harder to manage at scale&lt;/li&gt;
&lt;li&gt;Less flexible environment control&lt;/li&gt;
&lt;li&gt;Software conflicts can become painful&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bare metal is still the gold standard in traditional HPC clusters, especially when every microsecond counts.&lt;/p&gt;




&lt;h3&gt;
  
  
  Virtual Machines: Isolation with Overhead
&lt;/h3&gt;

&lt;p&gt;Virtual Machines (VMs) add a hypervisor layer, allowing multiple OS instances on the same hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why they’re used:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strong isolation between workloads&lt;/li&gt;
&lt;li&gt;Easy to snapshot, clone, and migrate&lt;/li&gt;
&lt;li&gt;Good for multi-tenant environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where they fit in HPC:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cloud-based HPC setups&lt;/li&gt;
&lt;li&gt;Development and testing environments&lt;/li&gt;
&lt;li&gt;Workloads that don’t need ultra-low latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Trade-offs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Performance overhead (CPU, I/O, networking)&lt;/li&gt;
&lt;li&gt;Limited access to specialized hardware (unless using passthrough)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;VMs are more common in cloud HPC than in on-prem clusters, where performance loss is less acceptable.&lt;/p&gt;




&lt;h3&gt;
  
  
  Containers: Lightweight and Portable
&lt;/h3&gt;

&lt;p&gt;Containers package applications with their dependencies, running on the host OS without a full VM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why they’re gaining popularity:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Near bare-metal performance&lt;/li&gt;
&lt;li&gt;Easy reproducibility&lt;/li&gt;
&lt;li&gt;Portable across environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Popular tools in HPC:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker (less common in production HPC)&lt;/li&gt;
&lt;li&gt;Singularity / Apptainer (designed for HPC)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where they shine:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI/ML workloads&lt;/li&gt;
&lt;li&gt;Research reproducibility&lt;/li&gt;
&lt;li&gt;Complex software stacks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Trade-offs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shared kernel (less isolation than VMs)&lt;/li&gt;
&lt;li&gt;Requires proper integration with schedulers like Slurm&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Containers strike a strong balance between performance and flexibility, which is why they’re rapidly becoming standard in modern HPC environments.&lt;/p&gt;




&lt;h3&gt;
  
  
  Choosing the Right Model for Your Use Case
&lt;/h3&gt;

&lt;p&gt;Instead of thinking about which one is “better,” it’s more useful to map each approach to real HPC scenarios.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Go with bare metal when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You’re running tightly coupled MPI jobs across nodes&lt;/li&gt;
&lt;li&gt;Network latency and bandwidth are critical&lt;/li&gt;
&lt;li&gt;You need full GPU or accelerator performance&lt;/li&gt;
&lt;li&gt;You’re operating a traditional on-prem HPC cluster&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is typical in scientific computing, engineering simulations, and large-scale physics workloads.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Use virtual machines when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You’re running HPC workloads in the cloud&lt;/li&gt;
&lt;li&gt;Multiple users or teams need strict isolation&lt;/li&gt;
&lt;li&gt;You want to spin up environments quickly for testing&lt;/li&gt;
&lt;li&gt;Performance is important, but not the absolute priority&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;VMs make sense in hybrid HPC setups or when infrastructure flexibility matters more than squeezing out every bit of performance.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Choose containers when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need reproducible environments across clusters&lt;/li&gt;
&lt;li&gt;Your workloads depend on complex or conflicting libraries&lt;/li&gt;
&lt;li&gt;You’re running AI/ML pipelines or modern data workloads&lt;/li&gt;
&lt;li&gt;You want users to bring their own software stack easily&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Containers are especially powerful in research environments where portability and consistency are critical.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Real-World Approach
&lt;/h3&gt;

&lt;p&gt;Most modern HPC environments don’t rely on just one model.&lt;/p&gt;

&lt;p&gt;A common pattern looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bare metal nodes for raw compute power&lt;/li&gt;
&lt;li&gt;Containers for application portability&lt;/li&gt;
&lt;li&gt;Virtual machines in cloud or hybrid layers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This hybrid approach gives you the best of all worlds: performance, flexibility, and scalability.&lt;/p&gt;




&lt;h3&gt;
  
  
  Final Thought
&lt;/h3&gt;

&lt;p&gt;HPC is evolving beyond just hardware. The focus is shifting toward how efficiently workloads can be deployed and reproduced.&lt;/p&gt;

&lt;p&gt;Bare metal still dominates performance-critical workloads. Containers are redefining usability and portability. Virtual machines fill the gap where flexibility and isolation are needed.&lt;/p&gt;

&lt;p&gt;The right choice depends on what you’re optimizing for, not what’s trending.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>hpc</category>
      <category>infrastructure</category>
      <category>productivity</category>
    </item>
    <item>
      <title>What Actually Happens When You Run sbatch in Slurm</title>
      <dc:creator>Muhammad Zubair Bin Akbar</dc:creator>
      <pubDate>Tue, 28 Apr 2026 20:40:47 +0000</pubDate>
      <link>https://dev.to/zubairakbar/what-actually-happens-when-you-run-sbatch-in-slurm-1ncj</link>
      <guid>https://dev.to/zubairakbar/what-actually-happens-when-you-run-sbatch-in-slurm-1ncj</guid>
      <description>&lt;p&gt;If you work with HPC clusters, you likely use sbatch every day. You submit a script and expect it to run.&lt;/p&gt;

&lt;p&gt;But that single command triggers a full workflow inside Slurm.&lt;/p&gt;

&lt;p&gt;Understanding this internal flow helps you debug issues faster, optimize job performance, and better understand how your cluster behaves.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Submitting the Job
&lt;/h2&gt;

&lt;p&gt;When you run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sbatch job.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You are not starting the job. You are submitting a request to Slurm.&lt;/p&gt;

&lt;h3&gt;
  
  
  The script includes:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Resource requirements such as CPUs, memory, GPUs&lt;/li&gt;
&lt;li&gt;Job metadata like name and output paths&lt;/li&gt;
&lt;li&gt;The actual commands to execute&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this point, Slurm simply accepts the job.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Communication with slurmctld
&lt;/h2&gt;

&lt;p&gt;The sbatch command sends the job to the Slurm controller daemon, slurmctld.&lt;/p&gt;

&lt;h3&gt;
  
  
  This daemon:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Assigns a Job ID&lt;/li&gt;
&lt;li&gt;Stores the job details&lt;/li&gt;
&lt;li&gt;Marks the job as PENDING&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nothing is running yet.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Job Enters the Queue
&lt;/h2&gt;

&lt;p&gt;The job is now placed in the scheduling queue.&lt;/p&gt;

&lt;h3&gt;
  
  
  evaluates:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Job priority&lt;/li&gt;
&lt;li&gt;Fairshare usage&lt;/li&gt;
&lt;li&gt;Partition limits&lt;/li&gt;
&lt;li&gt;Resource availability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This determines when your job will run.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Scheduling Decision
&lt;/h2&gt;

&lt;p&gt;The scheduler continuously checks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free nodes&lt;/li&gt;
&lt;li&gt;Resource fragmentation&lt;/li&gt;
&lt;li&gt;Backfill opportunities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your job fits available resources, it gets selected. Otherwise, it stays pending.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Resource Allocation
&lt;/h2&gt;

&lt;p&gt;Once selected, Slurm:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Assigns specific compute nodes&lt;/li&gt;
&lt;li&gt;Reserves CPUs, memory, and GPUs&lt;/li&gt;
&lt;li&gt;Changes job state to RUNNING&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now your job has allocated resources.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6: Node-Level Communication
&lt;/h2&gt;

&lt;p&gt;Each compute node runs a daemon called slurmd.&lt;/p&gt;

&lt;p&gt;The controller sends job details to these nodes. The nodes prepare the execution environment.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 7: Job Execution via slurmstepd
&lt;/h2&gt;

&lt;p&gt;On the compute node, slurmstepd is launched.&lt;/p&gt;

&lt;h3&gt;
  
  
  This process:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Starts your application&lt;/li&gt;
&lt;li&gt;Manages job steps&lt;/li&gt;
&lt;li&gt;Handles output and error streams&lt;/li&gt;
&lt;li&gt;Enforces resource limits using cgroups&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your script begins executing here.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 8: Monitoring During Execution
&lt;/h2&gt;

&lt;p&gt;While the job runs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slurm tracks resource usage&lt;/li&gt;
&lt;li&gt;Logs are written to output files&lt;/li&gt;
&lt;li&gt;Accounting data is collected&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can monitor the job using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;squeue
scontrol show job &amp;lt;jobid&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 9: Job Completion
&lt;/h2&gt;

&lt;p&gt;When the job finishes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;slurmstepd exits&lt;/li&gt;
&lt;li&gt;Resources are released&lt;/li&gt;
&lt;li&gt;Temporary processes are cleaned up&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The job state becomes COMPLETED, FAILED, TIMEOUT, or CANCELLED.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 10: Accounting and Logs
&lt;/h2&gt;

&lt;p&gt;Finally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Job statistics are stored&lt;/li&gt;
&lt;li&gt;Output files remain available&lt;/li&gt;
&lt;li&gt;Usage data is recorded&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can check this using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sacct
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Full Flow Summary
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Submit job using sbatch&lt;/li&gt;
&lt;li&gt;slurmctld receives and queues it&lt;/li&gt;
&lt;li&gt;Scheduler evaluates priority&lt;/li&gt;
&lt;li&gt;Resources are allocated&lt;/li&gt;
&lt;li&gt;slurmd prepares nodes&lt;/li&gt;
&lt;li&gt;slurmstepd runs the job&lt;/li&gt;
&lt;li&gt;Job completes and resources are released&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Misconceptions
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;“sbatch runs the job immediately”&lt;/em&gt;&lt;br&gt;
It only submits the job.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“Pending means failure”&lt;/em&gt;&lt;br&gt;
It usually means waiting for resources.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“Slurm just runs scripts”&lt;/em&gt;&lt;br&gt;
It manages scheduling, allocation, execution, and cleanup.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;sbatch may look simple, but it triggers a complete orchestration pipeline inside Slurm.&lt;/p&gt;

&lt;p&gt;Once you understand this flow, debugging becomes easier, performance tuning improves, and cluster behavior becomes predictable.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

</description>
      <category>ai</category>
      <category>slurm</category>
      <category>hpc</category>
      <category>jobscheduler</category>
    </item>
    <item>
      <title>4 Practical Boto3 Scripts for S3 Every DevOps Engineer Should Know</title>
      <dc:creator>Muhammad Zubair Bin Akbar</dc:creator>
      <pubDate>Sun, 26 Apr 2026 21:54:05 +0000</pubDate>
      <link>https://dev.to/zubairakbar/4-practical-boto3-scripts-for-s3-every-devops-engineer-should-know-4m27</link>
      <guid>https://dev.to/zubairakbar/4-practical-boto3-scripts-for-s3-every-devops-engineer-should-know-4m27</guid>
      <description>&lt;p&gt;Working with AWS S3 through the console is fine until you need automation, repeatability, and control. That’s where Boto3 comes in. In this post, we’ll walk through four practical Python scripts to manage S3 efficiently.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  1. List All S3 Buckets with Creation Dates
&lt;/h2&gt;

&lt;p&gt;A simple script to get visibility into your S3 environment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="n"&gt;s3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_buckets&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;S3 Buckets:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;bucket&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Buckets&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Name: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | Created On: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CreationDate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why this matters:
&lt;/h3&gt;

&lt;p&gt;Useful for audits, inventory tracking, or quick checks across accounts.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Upload a File to S3 with Error Handling
&lt;/h2&gt;

&lt;p&gt;Uploading files is common but handling failures properly is what makes scripts production-ready.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;botocore.exceptions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nb"&gt;FileNotFoundError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NoCredentialsError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ClientError&lt;/span&gt;
&lt;span class="n"&gt;s3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;file_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;bucket_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-bucket-name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;object_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;uploads/test.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upload_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;object_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;File uploaded successfully.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;FileNotFoundError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The file was not found.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;NoCredentialsError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Credentials not available.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ClientError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AWS Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why this matters:
&lt;/h3&gt;

&lt;p&gt;Prevents silent failures and gives clear debugging output.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Download Files from S3 with Progress Tracking
&lt;/h2&gt;

&lt;p&gt;For large files, progress tracking makes a big difference.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="n"&gt;s3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;bucket_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-bucket-name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;object_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;large-file.zip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;file_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;downloaded.zip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;progress_callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bytes_transferred&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Transferred: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bytes_transferred&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; bytes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;download_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;object_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;file_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;progress_callback&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Download complete.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why this matters:
&lt;/h3&gt;

&lt;p&gt;Gives visibility into long running downloads especially useful in automation pipelines.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Create and Delete S3 Buckets Programmatically
&lt;/h2&gt;

&lt;p&gt;Automating bucket lifecycle management is useful in testing and dynamic environments.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;botocore.exceptions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ClientError&lt;/span&gt;
&lt;span class="n"&gt;s3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;bucket_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-unique-bucket-name-12345&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="c1"&gt;# Create Bucket
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;Bucket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;CreateBucketConfiguration&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;LocationConstraint&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;eu-west-1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bucket created successfully.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ClientError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error creating bucket: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Delete Bucket
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete_bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Bucket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bucket deleted successfully.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ClientError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error deleting bucket: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Note:
&lt;/h3&gt;

&lt;p&gt;Make sure the bucket is empty before deleting, otherwise the delete operation will fail.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;These four scripts cover the most common S3 operations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Visibility (listing buckets)&lt;/li&gt;
&lt;li&gt;Data movement (upload/download)&lt;/li&gt;
&lt;li&gt;Resource lifecycle (create/delete)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They’re simple, but extremely useful when building automation around AWS.&lt;/p&gt;

&lt;p&gt;As you scale, you can extend these with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Logging&lt;/li&gt;
&lt;li&gt;Retry mechanisms&lt;/li&gt;
&lt;li&gt;Parallel uploads/downloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the kind of practical automation that saves time in real environments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>boto3</category>
      <category>aws</category>
    </item>
  </channel>
</rss>
