Infrastructure testing in practice

#python #kubernetes #devops #iac

Modern infrastructure moves fast. Configs change, components get upgraded, and small tweaks can have big ripple effects. Are you sure everything still works after each change?

That’s where infrastructure auto tests shine: they validate real behavior in a real cluster and act as living specs for your platform.

In this post, we’ll walk through a practical example: a test that validates Kubernetes Cluster Autoscaler by forcing a scale-up and checking end-to-end results.

What we test: Cluster Autoscaler

This is a simple example to show the approach. The goal: prove that writing infra tests is straightforward and genuinely useful.

Cluster Autoscaler watches pending pods and scales the right node group when capacity is tight.

Why test it? What can go wrong?

A bad autoscaler release.
Helm chart breaking change.
Misconfiguration.
Incompatible cluster components.
Access/quotas/IAM issues that block scale-up.
Networking issues (for external autoscalers).

Our test covers the full path end-to-end:

Pending pods ──> Cluster Autoscaler ──> Node group scales ──> Pods Ready

Detect the target node group for the current cluster context.
Create a deployment with replicas set to current_nodes + 1.
Use pod anti-affinity, so pods spread across nodes, guaranteeing the need for an extra node.
Use node affinity/selector, so pods land only on the intended node group.
Verify we see the expected pending pod initially; then wait for all pods to be Ready within a timeout.
Assert success when all pods are scheduled and ready.

How the test is designed

I wrote the test in Python because it’s easy to read, most engineers already know it, and it’s usually available by default. No extra Python deps needed.

If you prefer Bash, check out the bats framework.

Node group detection: EKS clusters use label eks.amazonaws.com/nodegroup and expect group main.
Workload shape: A tiny deployment using registry.k8s.io/pause:3.9 with small CPU/memory requests. We add strict scheduling constraints:
- Pod anti-affinity (requiredDuringSchedulingIgnoredDuringExecution) to force one pod per node.
- Node selector/affinity so we only use the target node group.
Assertions:
- Immediately after creation, we expect exactly one pod to be Pending.
- Within the timeout (5 minutes by default), all pods must become Ready.

Running the test

I use mise to run the test. It’s a handy tool to manage dependencies and tasks, so teammates don’t need to learn a custom Python CLI.

Run mise tasks to see what’s available. For the autoscaler test: mise test:cluster-autoscaler.

It’s easy to drop into CI: add kubectl and python to .mise.toml and run mise install up front. More details in my post about mise.

Code examples

Below are minimal snippets from the real test suite. Full code is in the repo.

# cluster_autoscaler/test.py

from utils import BaseInfraTest
# other imports

class ClusterAutoscaler(BaseInfraTest):
    RESOURCE_NAME = "infra-tests-cluster-autoscaler"
    TIMEOUT_SECONDS = 300  # 5 minutes
    DEPLOYMENT_CHECK_DELAY_SECONDS = 5

    def setUp(self):
        # prepare common things that are used in all tests
        super().setUp()

        # set internal variables
        # ...

    def _get_target_nodes(self):
        # Get all nodes and filter for target nodegroup
        # ...

    def test_cluster_autoscaler(self):
        # Get current nodes in target node group
        target_nodes, nodegroup_label_value = self._get_target_nodes()
        current_node_count = len(target_nodes)

        # Calculate required replicas (current nodes + extra to trigger scaling)
        required_replicas = current_node_count + 1

        # Read and prepare deployment manifest
        with open(self.deployment_path, encoding="utf-8") as f:
            manifest = f.read()

        # Apply deployment manifest
        self.command(
            f"kubectl --context {self.current_context} apply -f -",
            input=manifest
            % (
                self.RESOURCE_NAME,  # deployment name
                self.namespace,  # namespace
                self.RESOURCE_NAME,  # app label
                required_replicas,  # replicas count
                self.RESOURCE_NAME,  # selector matchLabels app
                self.RESOURCE_NAME,  # pod template app label
                json.dumps(
                    {self.nodegroup_label_key: nodegroup_label_value}
                ),  # nodeSelector
                self.RESOURCE_NAME,  # pod anti-affinity
            ),
        )

        # Give k8s a moment to process the deployment
        time.sleep(self.DEPLOYMENT_CHECK_DELAY_SECONDS)

        # Check that there is 1 pod in pending state
        pods_cmd = (
            f"kubectl --context {self.current_context}"
            f" --namespace {self.namespace}"
            f" get pods"
            f" -l app={self.RESOURCE_NAME}"
            f" -o jsonpath='{{.items[*].status.phase}}'"
        )
        pod_phases = self.command(pods_cmd).split()
        self.assertEqual(
            pod_phases.count("Pending"),
            1,
            f"Expected 1 pod in pending state, got {pod_phases.count('Pending')}",
        )

        # Wait for all pods to be ready using kubectl wait
        wait_cmd = (
            f"kubectl --context {self.current_context}"
            f" --namespace {self.namespace}"
            f" wait pod"
            f" -l app={self.RESOURCE_NAME}"
            f" --for=condition=Ready"
            f" --timeout={self.TIMEOUT_SECONDS}s"
        )

        self.command(wait_cmd)

Mise configuration:

# .mise.toml
[tools]
python = "3.13.0"
kubectl = "1.31.1"

[tasks.test]
description = "Run all infrastructure tests"
run = "python -m unittest discover -v"

[tasks."test:cluster-autoscaler"]
description = "Run tests for cluster-autoscaler"
run = """
python -m unittest -v cluster_autoscaler.test.ClusterAutoscaler
"""

deployment.yaml is a regular Kubernetes Deployment with %s placeholders that get templated at runtime. A full example is here.

Principles of solid infra tests

Simple: easy to maintain and add new.
Small: extract boilerplate into shared utilities.
Fast: speed matters.
Wait on conditions: wait for states, not fixed sleeps.

Well-structured tests make it easy to add new ones — even with help from LLMs. Write the test case, then ask to implement it using the existing ones as a reference.

Closing thoughts

Infrastructure tests are your early-warning system for platform regressions. By validating real behavior in real clusters, you turn assumptions into executable specs.

The Cluster Autoscaler test is a concise, low-risk example that catches issues you might otherwise notice only after a late-night surprise.

Apply the same approach to other components to keep changes safe and verified.