gokcedemirdurkut

Posted on Oct 26

Deploying and Customizing AWS ParallelCluster Service (PCS) for HPC Workloads

#aws #terraform #devops #hpc

I recently worked on a project involving AWS ParallelCluster Service (PCS).

The main goal was to build an HPC cluster that meets our specific requirements such as using an image with Python 3.10, installing the necessary dependencies and deploying a PCS cluster.
To achieve this, I built a custom AMI(Amazon Machine Image) using Packer, then launched a PCS cluster based on that image.
Throughout the process, I automated the workflow using Terraform, GitHub Actions and shell scripts.
After deploying the cluster, I verified that everything was working correctly by running several SLURM job commands, which confirmed that our setup was operational.

In this article, I’ll walk you through the entire process from building the custom AMI to running jobs on PCS.

What is AWS PCS?

AWS ParallelCluster Service (PCS) is a managed service that enables high-performance computing (HPC) on AWS. It’s designed for running parallel workloads such as simulations, ML training, or large-scale data analysis.
Using PCS, you can deploy and manage HPC clusters without manually configuring compute nodes, networking or schedulers. It integrates seamlessly with services like Amazon S3 and AWS Batch, supporting complex workloads efficiently.

Why Use AWS PCS?

Traditionally, deploying and managing HPC clusters required deep expertise in cluster configuration, job scheduling, and infrastructure management.
AWS PCS abstracts much of this complexity by offering:

Fully managed cluster orchestration
Integration with SLURM scheduler
Elastic scaling based on job demand
Infrastructure as Code (IaC) support with Terraform and CloudFormation

This makes AWS PCS an excellent choice for researchers, data scientists, and DevOps engineers who want to focus on workloads rather than infrastructure.

How Was the PCS Cluster Created?

The PCS cluster was provisioned using Terraform with the awscc
provider. The infrastructure is defined as code and includes the cluster, login/compute node groups and job queue.

Note: The awscc provider was required because PCS resources are not yet supported by the standard AWS provider. It uses the AWS Cloud Control API to manage newer services like PCS.

📢 Official Announcement — Terraform Support for AWS ParallelCluster Service (March 2025)

PCS Cluster Components

A PCS setup typically includes three main components:

Cluster: Defines general settings, scheduler configuration, and networking.
Node Groups: Specifies login and compute nodes.
Queue: Manages job scheduling and execution PCS Setup with Terraform.

Below is a simplified example of how a PCS cluster can be created using Terraform and the awscc provider:

# Create PCS Cluster using awscc provider
resource "awscc_pcs_cluster" "this" {
  count = var.enable_cluster ? 1 : 0
  name  = var.cluster_name
  size  = var.cluster_size

  scheduler = {
    type    = "SLURM"
    version = var.cluster_slurm_version
  }

  networking = {
    security_group_ids = [aws_security_group.this.id]
    subnet_ids         = var.subnet_ids
  }
}

Variables like cluster_name, subnet_ids, and cluster_slurm_version can be parameterized to adapt across environments.

Cost Awareness & Usage Control

Running a PCS cluster can become expensive depending on your configuration.
Each compute node in the cluster uses Amazon EC2 instances, and costs increase while those instances are running.

To manage this, we added a Terraform variable that controls whether the PCS cluster should be deployed:

# Control cluster deployment
enable_pcs_cluster = true   # create the cluster
# enable_pcs_cluster = false  # skip deployment

This allows you to enable or disable the PCS cluster as needed.
For example, during development or testing, you can set enable_pcs_cluster = false to avoid unnecessary charges.

Creating Custom AMIs with Packer

Custom AMIs let you pre-install software and dependencies on compute nodes.

Steps:

Create a Packer template based on a base AMI (e.x. Ubuntu or Amazon Linux 2).
Install the required software packages.
Build the AMI and use it in your PCS cluster.

Example Packer Template:

{
  "builders": [{
      "type": "amazon-ebs",
      "region": "{{user `aws_region`}}",
      "source_ami": "{{user `source_ami`}}",
      "instance_type": "{{user `instance_type`}}",
      "ssh_username": "{{user `ssh_username`}}",
      "ami_name": "{{user `ami_name_prefix`}}-{{user `distribution`}}-{{user `architecture`}}-{{isotime `2006.01.02-15.04`}}",
      "ami_description": "{{user `ami_description`}}"
  }],
  "provisioners": [{
    "type": "shell",
    "inline": [
      "sudo yum update -y",
      "sudo yum install -y gcc make python3"
      ...
    ]
  }]
}

This is just an example. You can add other dependencies as needed.

We used SLURM as the job scheduler within PCS to manage HPC workloads efficiently.
It handles queueing, job submission, and node allocation allowing multiple users or jobs to share compute resources dynamically.

During our setup, we required Python 3.10, but the default Amazon Linux 2 AMI provided only Python 3.7.
To solve this, we created a custom AMI using Packer, based on the following Ubuntu Marketplace image:

Ubuntu Server 22.04 LTS (arm64)
AMI ID: ami-0f45e2f16611e3139
Source: AWS Marketplace

This custom AMI included:

Python 3.10 preinstalled
Common HPC dependencies (gcc, make, python3-pip)
Additional Python libraries required for our workloads

By using a custom AMI, we ensured compatibility with our codebase and reduced setup time for new compute nodes.

Note: Custom AMIs are especially helpful when your HPC workloads require specific compiler versions, Python environments, or third-party libraries.

Testing the SLURM Cluster

Once the PCS cluster was up and running, we validated the environment by executing a few basic SLURM commands:

# Check the cluster and nodes
sinfo

# Submit a test job
echo "echo Hello from SLURM" > test.sh
sbatch test.sh

# View the job queue
squeue

These simple tests confirmed that the SLURM scheduler was active, compute nodes were responding, and jobs were successfully executed across the PCS cluster.

Conclusion

AWS PCS simplifies HPC cluster management with a managed, scalable, and elastic environment.
Terraform (awscc) enables modern Infrastructure as Code deployment.
SLURM provides flexible and familiar scheduling for HPC users.
Custom AMIs and bootstrap scripts enable deep customization and reproducibility.
Ideal for research, AI/ML training, and data-intensive simulations.

Thanks for reading!

#terraform #aws #devops #cloud #hpc #slurm #packer #ami #python

DEV Community