DEV Community

Justine Devasia
Justine Devasia

Posted on

Building Production-Ready Nomad Clusters on AWS with Terraform

Setting up a proper production Nomad cluster on AWS involves significant infrastructure complexity. After implementing this setup across multiple projects, I've created a reusable Terraform infrastructure for teams with existing AWS and infrastructure automation experience.

Prerequisites: This requires solid experience with AWS, Terraform, and preferably some Nomad knowledge. The infrastructure is designed for teams who understand these tools but want to avoid rebuilding service discovery and cluster management from scratch.

What's Included

This infrastructure provides a complete AWS setup for running Nomad clusters:

  • Multi-AZ VPC with proper subnet design
  • Consul cluster for service discovery and configuration
  • Nomad servers with auto-scaling groups
  • Specialized client pools for different workload types
  • Application Load Balancers with SSL termination
  • Security groups following least-privilege principles
  • S3 + CloudFront for static asset delivery
  • CI/CD pipeline configurations

Why Choose Nomad?

Kubernetes excels with dedicated platform engineering teams, but Nomad offers a simpler alternative for smaller teams or when operational complexity needs to be minimized. Nomad provides straightforward container orchestration with a significantly reduced learning curve.

The job specification syntax is minimal and readable:

job "my-app" {
  datacenters = ["dc1"]

  group "web" {
    count = 3

    task "app" {
      driver = "docker"
      config {
        image = "my-app:latest"
        ports = ["http"]
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

This approach eliminates the complexity of services, ingresses, config maps, and other Kubernetes abstractions while maintaining production-grade orchestration capabilities.

Deployment Process

Getting this infrastructure running involves several steps:

  1. Build custom AMI using the included Packer configuration
  2. Configure remote state with the provided Terraform backend setup
  3. Deploy infrastructure after updating variables for your environment
  4. Deploy applications using Nomad job specifications

The infrastructure implements security best practices:

  • No hardcoded secrets
  • Least-privilege IAM roles
  • Private subnets for workloads
  • Configurable CIDR blocks

Important: This isn't a one-click deployment. You'll need to understand the provisioning scripts, adjust networking configurations, and modify instance sizes for your specific requirements.

Specialized Node Pools

The infrastructure creates different node pools optimized for specific workloads:

Pool Type Purpose Constraints
Django Python web applications node.class = "django"
Elixir Phoenix applications node.class = "elixir"
Celery Background job processing node.class = "celery"
RabbitMQ Message queue services node.class = "rabbit"
Datastore Database workloads node.class = "datastore"
APM Monitoring tools node.class = "apm"

Nomad's constraint system automatically places jobs on appropriate nodes.

Multi-Environment Support

The configuration supports different environments with varying security profiles:

  • Development: More permissive settings for easier testing
  • Staging: Production-like with additional debugging capabilities
  • Production: Locked-down security with comprehensive monitoring

Implementation Considerations

Most infrastructure examples online fall into two categories: simplified demos that fail in production environments, or enterprise-grade solutions requiring dedicated platform teams. This infrastructure targets the middle ground - production-ready without excessive complexity.

The implementation has proven reliable across multiple projects, significantly reducing time spent on service discovery configuration and cluster bootstrapping.

However, this represents opinionated infrastructure decisions based on specific use cases. Production deployments will likely require modifications for different:

  • Instance types and sizing
  • Networking requirements
  • Compliance standards
  • Organizational policies

The codebase serves as a foundation for teams with the infrastructure expertise to adapt it appropriately.

Getting Started

The complete infrastructure is available as open source:

🔗 AWS Nomad Terraform Infrastructure

Quick Start

# 1. Build custom AMI
cd packer/
packer build ami.pkr.hcl

# 2. Setup remote state
cd tf-remote-state/dev/
terraform init && terraform apply

# 3. Deploy infrastructure  
cd tf-infra/
terraform init -backend-config="backend_develop.conf"
terraform apply -var-file="develop.tfvars"
Enter fullscreen mode Exit fullscreen mode

The solution is suitable for teams experienced with Kubernetes complexity who want to evaluate Nomad, or those already familiar with HashiCorp tooling. The README provides deployment instructions, though understanding the underlying Terraform modules is recommended before production use.

Final Notes

This infrastructure is provided as-is for teams comfortable with the required technology stack. The code is meant to be read, understood, and modified for your specific environment rather than used as a black box.

Bug reports and improvements via pull requests are welcome.


What has been your experience with container orchestration platforms in production environments? How do you evaluate trade-offs between operational complexity and feature completeness?

Top comments (0)