Justine Devasia

Posted on Jul 11

Building Production-Ready Nomad Clusters on AWS with Terraform

#aws #terraform #nomad #infrastructureascode

Setting up a proper production Nomad cluster on AWS involves significant infrastructure complexity. After implementing this setup across multiple projects, I've created a reusable Terraform infrastructure for teams with existing AWS and infrastructure automation experience.

Prerequisites: This requires solid experience with AWS, Terraform, and preferably some Nomad knowledge. The infrastructure is designed for teams who understand these tools but want to avoid rebuilding service discovery and cluster management from scratch.

What's Included

This infrastructure provides a complete AWS setup for running Nomad clusters:

Multi-AZ VPC with proper subnet design
Consul cluster for service discovery and configuration
Nomad servers with auto-scaling groups
Specialized client pools for different workload types
Application Load Balancers with SSL termination
Security groups following least-privilege principles
S3 + CloudFront for static asset delivery
CI/CD pipeline configurations

Why Choose Nomad?

Kubernetes excels with dedicated platform engineering teams, but Nomad offers a simpler alternative for smaller teams or when operational complexity needs to be minimized. Nomad provides straightforward container orchestration with a significantly reduced learning curve.

The job specification syntax is minimal and readable:

job "my-app" {
  datacenters = ["dc1"]

  group "web" {
    count = 3

    task "app" {
      driver = "docker"
      config {
        image = "my-app:latest"
        ports = ["http"]
      }
    }
  }
}

This approach eliminates the complexity of services, ingresses, config maps, and other Kubernetes abstractions while maintaining production-grade orchestration capabilities.

Deployment Process

Getting this infrastructure running involves several steps:

Build custom AMI using the included Packer configuration
Configure remote state with the provided Terraform backend setup
Deploy infrastructure after updating variables for your environment
Deploy applications using Nomad job specifications

The infrastructure implements security best practices:

No hardcoded secrets
Least-privilege IAM roles
Private subnets for workloads
Configurable CIDR blocks

Important: This isn't a one-click deployment. You'll need to understand the provisioning scripts, adjust networking configurations, and modify instance sizes for your specific requirements.

Specialized Node Pools

The infrastructure creates different node pools optimized for specific workloads:

Pool Type	Purpose	Constraints
Django	Python web applications	`node.class = "django"`
Elixir	Phoenix applications	`node.class = "elixir"`
Celery	Background job processing	`node.class = "celery"`
RabbitMQ	Message queue services	`node.class = "rabbit"`
Datastore	Database workloads	`node.class = "datastore"`
APM	Monitoring tools	`node.class = "apm"`

Nomad's constraint system automatically places jobs on appropriate nodes.

Multi-Environment Support

The configuration supports different environments with varying security profiles:

Development: More permissive settings for easier testing
Staging: Production-like with additional debugging capabilities
Production: Locked-down security with comprehensive monitoring

Implementation Considerations

Most infrastructure examples online fall into two categories: simplified demos that fail in production environments, or enterprise-grade solutions requiring dedicated platform teams. This infrastructure targets the middle ground - production-ready without excessive complexity.

The implementation has proven reliable across multiple projects, significantly reducing time spent on service discovery configuration and cluster bootstrapping.

However, this represents opinionated infrastructure decisions based on specific use cases. Production deployments will likely require modifications for different:

Instance types and sizing
Networking requirements
Compliance standards
Organizational policies

The codebase serves as a foundation for teams with the infrastructure expertise to adapt it appropriately.

Getting Started

The complete infrastructure is available as open source:

🔗 AWS Nomad Terraform Infrastructure

Quick Start

# 1. Build custom AMI
cd packer/
packer build ami.pkr.hcl

# 2. Setup remote state
cd tf-remote-state/dev/
terraform init && terraform apply

# 3. Deploy infrastructure  
cd tf-infra/
terraform init -backend-config="backend_develop.conf"
terraform apply -var-file="develop.tfvars"

The solution is suitable for teams experienced with Kubernetes complexity who want to evaluate Nomad, or those already familiar with HashiCorp tooling. The README provides deployment instructions, though understanding the underlying Terraform modules is recommended before production use.

Final Notes

This infrastructure is provided as-is for teams comfortable with the required technology stack. The code is meant to be read, understood, and modified for your specific environment rather than used as a black box.

Bug reports and improvements via pull requests are welcome.

What has been your experience with container orchestration platforms in production environments? How do you evaluate trade-offs between operational complexity and feature completeness?

DEV Community