Terraform, Autoscaling, Spot Capacity, and Workload Identity
This article focuses on implementation details. Architectural rationale and cost trade-offs are covered in Part 1.
Scope and Assumptions
This post assumes:
- Familiarity with Kubernetes fundamentals
- Comfort reading Terraform and Helm
- Interest in operating systems, not just deploying them
The platform runs on Azure Kubernetes Service, provisioned with Terraform, and deployed using Helm.
1. AKS Baseline: Start Small, Scale on Demand
The most common AKS cost mistake is provisioning for peak load.
We instead:
- Start with minimal baseline capacity
- Enable the cluster autoscaler
- Let demand drive node count
AKS Cluster with Autoscaling
resource "azurerm_kubernetes_cluster" "aks" {
name = var.cluster_name
location = var.location
resource_group_name = var.resource_group_name
dns_prefix = var.cluster_name
default_node_pool {
name = "default"
vm_size = "Standard_D2s_v5"
auto_scaling_enabled = true
min_count = 1
max_count = 10
}
identity {
type = "SystemAssigned"
}
}
Why this works
- Idle cost remains low
- Nodes are added only when pods are pending
- Capacity matches actual demand, not estimates
2. Horizontal Pod Autoscaling with Predictable Behavior
Autoscaling defaults are aggressive and often unstable.
We explicitly tune scale behavior to reduce churn and latency spikes.
HPA with Stabilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-service
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300
Key outcomes
- Prevents rapid scale-down during brief traffic dips
- Improves tail latency
- Reduces unnecessary pod restarts
3. Spot Node Pools for Fault-Tolerant Workloads
Spot capacity is one of the highest-leverage cost optimizations—when isolated properly.
Terraform: Spot Node Pool
resource "azurerm_kubernetes_cluster_node_pool" "spot" {
name = "spot"
kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
vm_size = "Standard_D2as_v5"
priority = "Spot"
eviction_policy = "Delete"
spot_max_price = -1
auto_scaling_enabled = true
min_count = 0
max_count = 10
node_taints = [
"kubernetes.azure.com/scalesetpriority=spot:NoSchedule"
]
}
Scheduling Workers on Spot Nodes
tolerations:
- key: "kubernetes.azure.com/scalesetpriority"
operator: "Equal"
value: "spot"
effect: "NoSchedule"
nodeSelector:
kubernetes.azure.com/scalesetpriority: spot
Rules we followed
- APIs never run on spot
- Workers handle
SIGTERMcleanly - All state lives outside the worker
Used this way, spot capacity delivered substantial savings without user-visible impact.
4. Managed PostgreSQL with Private Networking
Databases are not where cost experiments belong.
PostgreSQL runs as a managed service with:
- Subnet delegation
- Private DNS
- No public access
Delegated Subnet for PostgreSQL
resource "azurerm_subnet" "postgres" {
name = "postgres-subnet"
virtual_network_name = azurerm_virtual_network.main.name
resource_group_name = var.resource_group_name
address_prefixes = ["10.0.2.0/24"]
delegation {
name = "postgres"
service_delegation {
name = "Microsoft.DBforPostgreSQL/flexibleServers"
}
}
}
Why this matters
- Database is unreachable from the public internet
- Access is restricted at the network layer
- Operational risk is significantly reduced
5. Secure Deployments with Helm Defaults
Helm charts were written with secure-by-default assumptions.
Pod Security Context
securityContext:
runAsUser: 1000
runAsGroup: 1000
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
This immediately:
- Shrinks the attack surface
- Prevents runtime mutation
- Surfaces insecure images early
Health Probes That Matter
livenessProbe:
httpGet:
path: /health/db-cache
port: 8080
initialDelaySeconds: 60
We intentionally check dependencies, not just process health.
6. Workload Identity: No Secrets in Kubernetes
Storing cloud credentials in Kubernetes secrets is unnecessary.
We use Workload Identity for pod-to-Azure authentication.
Federated Identity Credential
resource "azurerm_federated_identity_credential" "api" {
name = "api-federated"
parent_id = azurerm_user_assigned_identity.api.id
issuer = azurerm_kubernetes_cluster.aks.oidc_issuer_url
subject = "system:serviceaccount:production:api-sa"
audience = ["api://AzureADTokenExchange"]
}
Kubernetes Service Account
apiVersion: v1
kind: ServiceAccount
metadata:
name: api-sa
annotations:
azure.workload.identity/client-id: "<client-id>"
Result
- No long-lived secrets in manifests
- Automatic token rotation
- Azure RBAC enforced at runtime
7. Observability Without Ingestion-Based Pricing
Instead of managed log ingestion, we use:
- Prometheus for metrics
- Loki for logs
- Object storage for retention
Loki Storage Configuration
storage_config:
azure:
container_name: logs
account_name: ${ACCOUNT_NAME}
access_tier: Cool
Why this works
- Logs are queried infrequently
- Storage is inexpensive
- Ingestion costs dominate managed observability pricing
This preserved full visibility with minimal incremental cost.
8. Infrastructure Access Patterns (Secure and Practical)
Cluster Access
- Azure AD–backed
kubectl - Auditable
- No shared credentials
In practice, cluster-admin access is restricted to a small bootstrap group; most teams use namespace-scoped roles.
Database Access (Occasional Admin Tasks)
kubectl run psql \
--image=postgres:16 \
--rm -it -- psql -h <private-host> -U admin
The database remains private; access is authenticated and auditable.
Observability Dashboards
Grafana is protected behind OAuth using Azure AD.
OAuth2-Proxy runs with multiple replicas and rotated cookie secrets to avoid becoming a single point of failure.
No VPNs. No bastion hosts. No additional managed services required.
Operational Lessons (The Real Ones)
Defaults Are Rarely Production-Safe
Autoscaling, probes, and security contexts need explicit tuning.
Spot Capacity Requires Discipline
It works extremely well—but only when isolated.
Identity Scales Better Than Secrets
It reduces operational load and security risk.
Kubernetes Is a Tool, Not a Destination
Use it where it adds leverage—not as a dumping ground.
Closing Thoughts
This implementation isn’t about clever tricks—it’s about intentional trade-offs.
By:
- Scaling only when needed
- Using spot capacity responsibly
- Keeping critical state managed
- Avoiding ingestion-based observability costs
- Treating security as a default
we ended up with a system that is:
- Predictable to operate
- Cost-efficient at idle
- Resilient under load
- Easy to evolve
Architecture sets direction.
Implementation determines whether it survives contact with reality.

Top comments (0)