Modern infrastructure moves fast. Configs change, components get upgraded, and small tweaks can have big ripple effects. Are you sure everything still works after each change?
That’s where infrastructure auto tests shine: they validate real behavior in a real cluster and act as living specs for your platform.
In this post, we’ll walk through a practical example: a test that validates Kubernetes Cluster Autoscaler by forcing a scale-up and checking end-to-end results.
What we test: Cluster Autoscaler
This is a simple example to show the approach. The goal: prove that writing infra tests is straightforward and genuinely useful.
Cluster Autoscaler watches pending pods and scales the right node group when capacity is tight.
Why test it? What can go wrong?
A bad autoscaler release.
Helm chart breaking change.
Misconfiguration.
Incompatible cluster components.
Access/quotas/IAM issues that block scale-up.
Networking issues (for external autoscalers).
Our test covers the full path end-to-end:
Pending pods ──> Cluster Autoscaler ──> Node group scales ──> Pods Ready
Detect the target node group for the current cluster context.
Create a deployment with replicas set to
current_nodes + 1
.Use pod anti-affinity, so pods spread across nodes, guaranteeing the need for an extra node.
Use node affinity/selector, so pods land only on the intended node group.
Verify we see the expected pending pod initially; then wait for all pods to be
Ready
within a timeout.Assert success when all pods are scheduled and ready.
How the test is designed
I wrote the test in Python because it’s easy to read, most engineers already know it, and it’s usually available by default. No extra Python deps needed.
If you prefer Bash, check out the bats framework.
Node group detection: EKS clusters use label
eks.amazonaws.com/nodegroup
and expect groupmain
.-
Workload shape: A tiny deployment using
registry.k8s.io/pause:3.9
with small CPU/memory requests. We add strict scheduling constraints:- Pod anti-affinity (
requiredDuringSchedulingIgnoredDuringExecution
) to force one pod per node. - Node selector/affinity so we only use the target node group.
- Pod anti-affinity (
-
Assertions:
- Immediately after creation, we expect exactly one pod to be
Pending
. - Within the timeout (5 minutes by default), all pods must become
Ready
.
- Immediately after creation, we expect exactly one pod to be
Running the test
I use mise
to run the test. It’s a handy tool to manage dependencies and tasks, so teammates don’t need to learn a custom Python CLI.
Run mise tasks
to see what’s available. For the autoscaler test: mise test:cluster-autoscaler
.
It’s easy to drop into CI: add kubectl
and python
to .mise.toml
and run mise install
up front. More details in my post about mise.
Code examples
Below are minimal snippets from the real test suite. Full code is in the repo.
# cluster_autoscaler/test.py
from utils import BaseInfraTest
# other imports
class ClusterAutoscaler(BaseInfraTest):
RESOURCE_NAME = "infra-tests-cluster-autoscaler"
TIMEOUT_SECONDS = 300 # 5 minutes
DEPLOYMENT_CHECK_DELAY_SECONDS = 5
def setUp(self):
# prepare common things that are used in all tests
super().setUp()
# set internal variables
# ...
def _get_target_nodes(self):
# Get all nodes and filter for target nodegroup
# ...
def test_cluster_autoscaler(self):
# Get current nodes in target node group
target_nodes, nodegroup_label_value = self._get_target_nodes()
current_node_count = len(target_nodes)
# Calculate required replicas (current nodes + extra to trigger scaling)
required_replicas = current_node_count + 1
# Read and prepare deployment manifest
with open(self.deployment_path, encoding="utf-8") as f:
manifest = f.read()
# Apply deployment manifest
self.command(
f"kubectl --context {self.current_context} apply -f -",
input=manifest
% (
self.RESOURCE_NAME, # deployment name
self.namespace, # namespace
self.RESOURCE_NAME, # app label
required_replicas, # replicas count
self.RESOURCE_NAME, # selector matchLabels app
self.RESOURCE_NAME, # pod template app label
json.dumps(
{self.nodegroup_label_key: nodegroup_label_value}
), # nodeSelector
self.RESOURCE_NAME, # pod anti-affinity
),
)
# Give k8s a moment to process the deployment
time.sleep(self.DEPLOYMENT_CHECK_DELAY_SECONDS)
# Check that there is 1 pod in pending state
pods_cmd = (
f"kubectl --context {self.current_context}"
f" --namespace {self.namespace}"
f" get pods"
f" -l app={self.RESOURCE_NAME}"
f" -o jsonpath='{{.items[*].status.phase}}'"
)
pod_phases = self.command(pods_cmd).split()
self.assertEqual(
pod_phases.count("Pending"),
1,
f"Expected 1 pod in pending state, got {pod_phases.count('Pending')}",
)
# Wait for all pods to be ready using kubectl wait
wait_cmd = (
f"kubectl --context {self.current_context}"
f" --namespace {self.namespace}"
f" wait pod"
f" -l app={self.RESOURCE_NAME}"
f" --for=condition=Ready"
f" --timeout={self.TIMEOUT_SECONDS}s"
)
self.command(wait_cmd)
Mise configuration:
# .mise.toml
[tools]
python = "3.13.0"
kubectl = "1.31.1"
[tasks.test]
description = "Run all infrastructure tests"
run = "python -m unittest discover -v"
[tasks."test:cluster-autoscaler"]
description = "Run tests for cluster-autoscaler"
run = """
python -m unittest -v cluster_autoscaler.test.ClusterAutoscaler
"""
deployment.yaml
is a regular Kubernetes Deployment with %s
placeholders that get templated at runtime. A full example is here.
Principles of solid infra tests
Simple: easy to maintain and add new.
Small: extract boilerplate into shared utilities.
Fast: speed matters.
Wait on conditions: wait for states, not fixed sleeps.
Well-structured tests make it easy to add new ones — even with help from LLMs. Write the test case, then ask to implement it using the existing ones as a reference.
Closing thoughts
Infrastructure tests are your early-warning system for platform regressions. By validating real behavior in real clusters, you turn assumptions into executable specs.
The Cluster Autoscaler test is a concise, low-risk example that catches issues you might otherwise notice only after a late-night surprise.
Apply the same approach to other components to keep changes safe and verified.
Top comments (0)