Declarative Chaos: Building Failure Experiments via Infrastructure-as-Code

Failure is inevitable in distributed systems. But it doesn't have to be unpredictable.

Chaos engineering—intentionally injecting failures to observe system behavior—has become a standard practice for resilience testing. Yet for many teams, it's still performed as a manual or ad hoc process, often siloed from broader platform operations.

What if chaos experiments could be codified, version-controlled, peer-reviewed, and orchestrated just like the rest of your infrastructure?

That’s the promise of declarative chaos engineering—an approach where failure experiments are written, managed, and executed as part of your infrastructure-as-code (IaC) workflows. When integrated with platform engineering principles, it offers a safe, auditable, and automated path to resilience.

From ClickOps to GitOps to ChaosOps

Modern platform teams already manage their infrastructure using declarative tools like Terraform, Pulumi, or Helm. These tools provide consistency, collaboration, and control through code.

By extending the same practices to chaos engineering, teams can:

Define failure scenarios as declarative code
Store them in version control alongside app/service configs
Review them like any other pull request
Trigger them through CI/CD or scheduled jobs
Roll them back with Git if needed
This approach brings chaos engineering into the realm of GitOps and platform-as-code, making it both accessible and operationally mature.

Defining Chaos as Code: Examples

Let’s say you want to test how your Kubernetes service behaves under CPU exhaustion. A declarative chaos module could look like:

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress
spec:
  mode: one
  selector:
    namespaces:
      - improwised-payment
  stressors:
    cpu:
      workers: 4
  duration: "60s"

Or, using Terraform with Chaos Toolkit plugins, you might codify:

resource "chaos_experiment" "network_latency" {
  target_service = "improwised-checkout-api"
  fault_type     = "latency"
  delay_ms       = 300
  duration       = 120
}

This shift enables chaos engineering to live alongside deployment manifests, observability dashboards, and policy definitions—ensuring cohesion across the platform.

Benefits of Declarative Chaos in Platform Engineering

By adopting chaos-as-code within a platform engineering framework, teams gain:

Reusability: Standard fault templates can be applied across environments.
Auditability: All chaos actions are logged, reviewed, and traceable.
Repeatability: Run identical experiments in dev, staging, or prod.
Safe experimentation: Guardrails via RBAC, scopes, and timeouts.
Automation: Trigger chaos tests automatically via CI/CD, Git events, or scheduled jobs.

This approach naturally complements code and infrastructure management practices that already exist in many platform engineering teams—making chaos part of the everyday pipeline, not a risky one-off event.

Practical Considerations

Implementing declarative chaos effectively requires:

Version-controlled configuration
Store chaos files in the same repositories as services they affect.
Controlled environments
Start with sandboxed clusters or staging environments before moving to production scenarios.
Observability integration
Ensure tools like Prometheus, Grafana, and OpenTelemetry are in place to track metrics during tests.
Approval workflows
Use PR reviews, CI policies, or GitHub Actions to gate experiment execution.
Scope isolation
Define the namespace, time window, and target pods to prevent unintended spread.

A Real-World Use Case

Consider a team running a microservices platform on Kubernetes. They want to test if their order-processing service can handle intermittent network issues with downstream APIs.

Instead of manually injecting latency or setting up complex chaos suites, they define a simple YAML-based fault scenario using Chaos Mesh. It’s stored in Git, triggered by a CI job every week, and monitored with pre-defined Grafana dashboards.

Over time, these tests reveal missing retry logic and a lack of circuit breakers. After addressing these issues, the system not only becomes more resilient—but the tests themselves become a living regression suite for reliability.

Final Thoughts

Chaos engineering doesn’t have to be disruptive. With a declarative, platform-centric approach, it becomes just another layer of infrastructure testing—codified, automated, and safe.

By integrating fault injection directly into infrastructure workflows, teams can normalize failure testing the same way they normalized unit tests or linting. Declarative chaos turns “what if” into “we already know”—and that’s a superpower every platform should have.