Setting up a proper production Nomad cluster on AWS involves significant infrastructure complexity. After implementing this setup across multiple projects, I've created a reusable Terraform infrastructure for teams with existing AWS and infrastructure automation experience.
Prerequisites: This requires solid experience with AWS, Terraform, and preferably some Nomad knowledge. The infrastructure is designed for teams who understand these tools but want to avoid rebuilding service discovery and cluster management from scratch.
What's Included
This infrastructure provides a complete AWS setup for running Nomad clusters:
- Multi-AZ VPC with proper subnet design
- Consul cluster for service discovery and configuration
- Nomad servers with auto-scaling groups
- Specialized client pools for different workload types
- Application Load Balancers with SSL termination
- Security groups following least-privilege principles
- S3 + CloudFront for static asset delivery
- CI/CD pipeline configurations
Why Choose Nomad?
Kubernetes excels with dedicated platform engineering teams, but Nomad offers a simpler alternative for smaller teams or when operational complexity needs to be minimized. Nomad provides straightforward container orchestration with a significantly reduced learning curve.
The job specification syntax is minimal and readable:
job "my-app" {
datacenters = ["dc1"]
group "web" {
count = 3
task "app" {
driver = "docker"
config {
image = "my-app:latest"
ports = ["http"]
}
}
}
}
This approach eliminates the complexity of services, ingresses, config maps, and other Kubernetes abstractions while maintaining production-grade orchestration capabilities.
Deployment Process
Getting this infrastructure running involves several steps:
- Build custom AMI using the included Packer configuration
- Configure remote state with the provided Terraform backend setup
- Deploy infrastructure after updating variables for your environment
- Deploy applications using Nomad job specifications
The infrastructure implements security best practices:
- No hardcoded secrets
- Least-privilege IAM roles
- Private subnets for workloads
- Configurable CIDR blocks
Important: This isn't a one-click deployment. You'll need to understand the provisioning scripts, adjust networking configurations, and modify instance sizes for your specific requirements.
Specialized Node Pools
The infrastructure creates different node pools optimized for specific workloads:
Pool Type | Purpose | Constraints |
---|---|---|
Django | Python web applications | node.class = "django" |
Elixir | Phoenix applications | node.class = "elixir" |
Celery | Background job processing | node.class = "celery" |
RabbitMQ | Message queue services | node.class = "rabbit" |
Datastore | Database workloads | node.class = "datastore" |
APM | Monitoring tools | node.class = "apm" |
Nomad's constraint system automatically places jobs on appropriate nodes.
Multi-Environment Support
The configuration supports different environments with varying security profiles:
- Development: More permissive settings for easier testing
- Staging: Production-like with additional debugging capabilities
- Production: Locked-down security with comprehensive monitoring
Implementation Considerations
Most infrastructure examples online fall into two categories: simplified demos that fail in production environments, or enterprise-grade solutions requiring dedicated platform teams. This infrastructure targets the middle ground - production-ready without excessive complexity.
The implementation has proven reliable across multiple projects, significantly reducing time spent on service discovery configuration and cluster bootstrapping.
However, this represents opinionated infrastructure decisions based on specific use cases. Production deployments will likely require modifications for different:
- Instance types and sizing
- Networking requirements
- Compliance standards
- Organizational policies
The codebase serves as a foundation for teams with the infrastructure expertise to adapt it appropriately.
Getting Started
The complete infrastructure is available as open source:
🔗 AWS Nomad Terraform Infrastructure
Quick Start
# 1. Build custom AMI
cd packer/
packer build ami.pkr.hcl
# 2. Setup remote state
cd tf-remote-state/dev/
terraform init && terraform apply
# 3. Deploy infrastructure
cd tf-infra/
terraform init -backend-config="backend_develop.conf"
terraform apply -var-file="develop.tfvars"
The solution is suitable for teams experienced with Kubernetes complexity who want to evaluate Nomad, or those already familiar with HashiCorp tooling. The README provides deployment instructions, though understanding the underlying Terraform modules is recommended before production use.
Final Notes
This infrastructure is provided as-is for teams comfortable with the required technology stack. The code is meant to be read, understood, and modified for your specific environment rather than used as a black box.
Bug reports and improvements via pull requests are welcome.
What has been your experience with container orchestration platforms in production environments? How do you evaluate trade-offs between operational complexity and feature completeness?
Top comments (0)