Infrastructure as Code Best Practices: Terraform, Pulumi, and OpenTofu in 2026

#terraform #opensource #pulumichallenge

Infrastructure as Code Best Practices: Terraform, Pulumi, and OpenTofu in 2026

The Infrastructure as Code landscape changed permanently in August 2023 when HashiCorp moved Terraform from the Mozilla Public License to the Business Source License 1.1. The change restricted commercial use for products that compete with HashiCorp's own offerings. Within weeks, the OpenTofu fork was announced under the Linux Foundation. It reached general availability in January 2024.

Three years later, engineering teams are settled into a stable but fractured landscape. Terraform remains dominant by installed base. OpenTofu is the default choice for teams that need an open-source, community-governed alternative. Pulumi occupies a different position: it replaces HCL entirely with general-purpose languages, which makes it compelling for teams that want to apply software engineering practices directly to infrastructure.

The tool you choose matters less than how you use it. State management mistakes, oversized modules, and missing tests cause production incidents regardless of which binary you run. This piece covers what actually works in production across all three.

The IaC Landscape in 2026

Dimension	Terraform	OpenTofu	Pulumi
License	BSL 1.1 (commercial restrictions)	MPL 2.0 (fully open-source)	Apache 2.0
Language	HCL	HCL (compatible)	TypeScript, Python, Go, C#, Java, YAML
State backend	Local, S3, GCS, Terraform Cloud	Local, S3, GCS, any Terraform backend	Local, S3, GCS, Pulumi Cloud
Unit test support	Limited (terratest, Go required)	Same as Terraform	Native (Jest, pytest, standard frameworks)
Provider ecosystem	Largest (1,800+ providers)	Near-identical to Terraform	Growing (1,200+ providers)
Hosted backend pricing	Terraform Cloud (per resource)	Self-hosted or third-party	Pulumi Cloud (per user)
Notable adopters	Most enterprises pre-2024	Gruntwork, env0, Spacelift	Mercedes-Benz, Snowflake, Lemonade

OpenTofu 1.8 shipped provider-defined functions and early variable evaluation in 2024, features Terraform had not yet released. The two tools are now diverging at the feature level, not just the governance level. Migration from Terraform to OpenTofu is still a one-line binary swap plus a state migration command for most teams, but that window is narrowing as feature sets diverge.

Pulumi's advantage is not language choice. It is testability. You can write unit tests for Pulumi programs using standard test frameworks without provisioning real infrastructure. HCL has no equivalent capability.

State Management: The Foundation Everything Else Rests On

State files are the most dangerous files in your infrastructure. They contain resource IDs, outputs, and often plaintext secrets. A corrupted or lost state file means you have running infrastructure with no record of how it was created. Manual console changes, race conditions from concurrent applies, and state file leaks are responsible for roughly 23% of production infrastructure incidents in teams without proper state management.

The three non-negotiable rules for state management:

Remote state with locking. Never use local state in shared environments. Configure an S3 backend with a DynamoDB lock table (AWS) or a GCS backend with object locking (GCP). The lock prevents two engineers from running apply simultaneously on the same state file, which is how state corruption happens.

Directory-based environment separation. Terraform workspaces look appealing: one set of code, switch environments with terraform workspace select staging. In practice, workspaces share the same backend configuration and the same variable files unless you add significant scaffolding. A typo in the workspace name applies production infrastructure. Directory-based separation, where environments/dev, environments/staging, and environments/prod each have their own state file and their own variable file, is explicit and auditable.

Approach	Blast radius	State isolation	Auditability	Failure mode
Workspace-based	One workspace affects all	Shared backend, separate keys	Hard to trace which workspace was active	Applying to wrong workspace
Directory-based	One directory, one environment	Fully isolated state files	Explicit, matches repo structure	None common

State access controls. Treat state file read access as equivalent to production database read access. Anyone who can read the state file can read every output, every resource ID, and potentially every secret stored in outputs. In AWS, use S3 bucket policies with per-role access. In GCP, use GCS IAM bindings. Audit state access quarterly.

State file versioning is the safety net. Enable S3 versioning or GCS object versioning on your state backend. When a bad apply corrupts state, you restore to the previous version and re-run. Without versioning, you are manually reconstructing state from the AWS console, which takes hours and introduces new errors.

Module Design: Small, Composable, Testable

The average Terraform module grows to 2,400 lines before teams split it. At that size, terraform plan takes 4 to 7 minutes. terraform apply takes 12 to 20 minutes. A single change to a security group rule forces a plan across 200 unrelated resources. Teams that enforce a 500-line module limit see 3x faster plan and apply cycles, and dramatically lower blast radius per change.

The right mental model for module design is two layers:

Resource modules wrap a single provider resource type with sensible defaults and validation. A rds-instance module takes a database size and engine version. It does not also create the VPC, the subnet group, and the parameter group. Each of those is a separate module.

Service modules compose resource modules into a deployable unit. A postgres-service module calls the rds-instance module, the subnet-group module, and the security-group module. It outputs the connection string and the secret ARN. The root module calls service modules.

Design	Plan time	Blast radius per change	Reuse across teams	Test complexity
Monolithic (2,000+ lines)	7-15 min	Entire infrastructure stack	Low, too opinionated	Very high
Service modules (500 lines)	2-4 min	One service	Medium	Medium
Resource modules (150 lines)	Under 1 min	One resource type	High	Low

In Pulumi, the equivalent pattern uses component resources: TypeScript classes that extend pulumi.ComponentResource. A PostgresService class instantiates an RDS instance, a security group, and a subnet group, and exposes the connection string as an output property. The composition is identical to HCL modules in concept, but the implementation uses standard object-oriented patterns. You can instantiate it in a loop, pass it a config object, and test it with Jest without touching AWS.

Testing IaC: From Static Analysis to Integration Tests

IaC without tests is a change management problem. Every apply is a deployment to production infrastructure, and you have no way to know whether your change does what you intended until after it runs. The testing pyramid for IaC has three layers.

Static analysis runs before any cloud API call. Checkov scans your Terraform or OpenTofu plans for security misconfigurations: public S3 buckets, missing encryption, overly permissive IAM policies. It catches an average of 14 high-severity issues per 1,000 lines of IaC in teams without prior security scanning. tflint catches type errors, deprecated syntax, and provider-specific rule violations. Both run in under 30 seconds and belong in every pre-commit hook and CI pipeline.

Plan validation runs terraform plan or tofu plan and asserts on the output. You can check that a plan does not destroy resources unexpectedly, that the number of resources being created matches expectations, or that specific attributes have specific values. Open Policy Agent (OPA) with Conftest provides a policy-as-code layer for this: write Rego rules that evaluate plan JSON output and fail the CI pipeline if a rule is violated.

Integration tests provision real infrastructure in a temporary account or project and validate it end-to-end. Terratest, written in Go, is the most widely used framework for this. It provisions, runs assertions (can we connect to the database? Does the load balancer return 200?), and tears down. The investment is significant: you need a dedicated test account, test execution adds 15 to 45 minutes to CI, and writing Go is required. The return is also significant: mean time to detect infrastructure regressions drops from 4.2 days (found in production) to under 30 minutes (found in CI).

Layer	Tool	Runs when	Language	What it catches
Static analysis	Checkov, tflint	Pre-commit, CI	Any	Security misconfigs, syntax errors
Plan validation	OPA/Conftest	CI	Rego	Unexpected destroys, policy violations
Integration tests	Terratest, Pulumi test	CI (nightly or PR)	Go / any	Runtime failures, connectivity, behavior

Secrets, Drift, and the Two Problems Nobody Talks About

Secrets in state files are the most consistently underestimated risk in IaC. Terraform, OpenTofu, and Pulumi all store outputs in state files. If you output a database password, an API key, or a private certificate, it is stored in plaintext in the state file. Anyone with state file read access has those secrets.

The solution is to never store secrets as Terraform outputs. Instead, write secrets directly to a secrets manager during apply, and have applications retrieve them at runtime. Use SOPS to encrypt sensitive variable files before committing them. Use Vault's Terraform provider or AWS Secrets Manager to generate and rotate secrets within the provider, so they never pass through state.

The anti-pattern to avoid: generating a random password with random_password, outputting it, and using it in a module. The password is now in state, in the output, and potentially in CI logs. Rotate all secrets whenever state file access cannot be fully audited.

Drift is the gap between what your IaC declares and what is actually running in your cloud account. Env0's 2024 State of IaC report found that 67% of teams using IaC experience significant drift. The two primary sources are manual console changes (someone fixes an incident by editing a security group directly) and auto-scaling side effects (EC2 Auto Scaling adds instances that are not in state).

Drift detection requires scheduled plan runs. Run terraform plan or tofu plan in a read-only mode on a schedule, every 4 to 6 hours, and alert when the plan shows a non-empty diff. A diff means something changed outside of IaC. You then make a decision: import the change into state, revert it, or update the IaC to declare it intentionally.

Teams that run scheduled drift detection catch manual console changes within 6 hours instead of discovering them weeks later during the next apply, when the plan unexpectedly wants to destroy a resource someone is depending on.

Choosing Between Terraform, OpenTofu, and Pulumi in 2026

This is not a close call for most teams. The decision depends on three factors: your current footprint, your team's language background, and your commercial constraints.

Scenario	Recommended tool	Reason
Existing large Terraform codebase, no commercial conflicts	Stay on Terraform	Migration cost outweighs benefit
Existing Terraform codebase, vendor lock-in concern	Migrate to OpenTofu	Near-zero migration cost, open governance
Greenfield, team has strong software engineering background	Pulumi	Testability, real languages, component model
Greenfield, team prefers declarative config	OpenTofu	Open-source Terraform with active feature development
Multi-cloud with complex conditional logic	Pulumi	HCL conditionals do not scale to complex branching
Regulated environment requiring open-source stack	OpenTofu	BSL restrictions may conflict with compliance requirements

The switching cost from Terraform to OpenTofu is low now and rising. The feature sets are diverging. If you are going to migrate, doing it before the gap widens is easier than doing it in two years when your code relies on Terraform-specific features.

The switching cost from Terraform or OpenTofu to Pulumi is high: you rewrite all IaC in a new language and framework. The benefit is also real: unit testability, type safety, and general-purpose language constructs for complex infrastructure. For teams building internal developer platforms or infrastructure abstractions used by many other teams, that investment pays off. For teams with stable, low-complexity infrastructure, it probably does not.

The best practice that applies regardless of tool: treat your IaC codebase with the same engineering discipline as your application code. Code review for every change, tests at every layer, linting in every CI pipeline, and secrets management from day one. The tool is secondary. The practice is what keeps production stable.