Bader Eddine El ouerghi

Posted on Jun 23

How We Built a Multi-Cloud Terraform Orchestrator: The TerraX Architecture

#ai #devops #terraform #cloud

Beta is open. If you want to try TerraX before finishing this article: docs.terrax-cloud.com

Why We Built This

Managing infrastructure across AWS, GCP, Azure, and Cloudflare with a small team taught us one thing fast: Terraform itself is not the bottleneck. Everything around it is.

State backends scattered across providers. Credentials living on individual laptops. No shared record of who ran a plan and what changed. And a persistent gap between "infrastructure provisioned" and "things running on that infrastructure configured" — a gap that gets filled by Bash scripts nobody wants to maintain.

We wanted a platform that treated each cloud provider as a first-class object: centralized credentials, isolated state per provider, a sandboxed executor that runs terraform plan and terraform apply with full log streaming, and a workflow engine that could chain Terraform runs with script functions and pass outputs between them. That became TerraX.

This article walks through the key architectural decisions — what we chose, what we nearly chose, and where we got things wrong the first time.

System Overview

TerraX is composed of four main layers:

[ Angular Frontend ]
        |
[ Spring Boot Backend — "teraflare" ]
        |
[ Go Agent — "terrax-agent" ]
        |
[ Terraform / OpenTofu executor ]

Frontend (Angular) handles the UI layer: resource forms, plan/apply diff views, workflow builder, kubeconfig manager, compliance dashboards. It talks exclusively to the backend via REST.

Backend (Spring Boot / Java) is the core: provider credential storage, Terraform state management, RBAC enforcement, workflow orchestration, cost forecasting, compliance scanning. DB migrations run via Flyway.

Agent (Go) is the sandboxed executor. It receives jobs from the backend, bundles workspace files, runs Terraform/OpenTofu, and streams logs back in real time. It runs separately from the backend for isolation — a misbehaving Terraform run can't take down the API server.

Terraform / OpenTofu is the execution engine underneath everything. We default to OpenTofu.

The Go Agent: Why Go, and How the Sandbox Works

The decision to write the agent in Go rather than extending the Java backend was deliberate. Terraform execution is I/O-heavy and long-running — sometimes 10+ minutes for a full apply on a complex provider. We didn't want those goroutines competing with API request handling on the same JVM.

Go gave us lightweight concurrency for managing multiple concurrent jobs, straightforward process management for spawning tofu as a subprocess, and simple binary deployment into the Kubernetes pod that runs the executor.

When the backend enqueues a job, the agent:

Receives the job payload (provider credentials, workspace ID, operation type)
Bundles the workspace: pulls .tf files, variable files, and state from the backend
Writes them to a temporary working directory
Spawns tofu plan or tofu apply as a subprocess
Streams stdout/stderr back to the backend line by line
Reports the final exit code and cleans up

The key design decision here: state is never stored on the agent. The agent is stateless — it pulls what it needs per job and pushes results back. This means you can run multiple agent instances horizontally and jobs route to whichever is available.

We chose OpenTofu as the default because of the licensing clarity. The routing is configurable per job if you need Terraform for a specific provider version.

Cross-Cloud Kubeconfig Generation Without the Cloud CLIs

One of the most concrete problems we solved is kubeconfig generation across EKS, GKE, and AKS without depending on aws, gcloud, or az being installed anywhere.

Each provider has a completely different auth model:

EKS uses an exec-plugin model. The kubeconfig calls aws eks get-token at authentication time, which means the AWS CLI must be installed wherever the kubeconfig is used. We generate the kubeconfig server-side by calling eks:DescribeCluster via the AWS SDK to get the cluster endpoint and CA data, then embed the exec-plugin block pointing to the AWS SDK's token endpoint. Auth still flows through the SDK — we're not storing long-lived tokens.

GKE is simpler in some ways: we call the GCP Container REST API with a service account token, pull the cluster endpoint and CA certificate, and embed a short-lived Bearer token directly in the kubeconfig. The downside is that token expiry means regenerating the kubeconfig periodically, which we handle transparently.

AKS provides the cleanest result: listClusterAdminCredential from the Azure ARM API returns a ready-to-use kubeconfig with a cert-based auth block. No token rotation needed, though the cert-based approach means these are longer-lived credentials that need to be stored carefully.

All three are stored namespaced per cluster in the backend database, accessible based on the user's role, and selectable as the target context for any downstream operation — ArgoCD installs, Helm runs, raw kubectl operations via the agent.

The Workflow Engine: Passing Outputs Between Steps

The workflow engine is what ties infrastructure provisioning to operational automation. A workflow is a sequence of steps with conditional logic — each step is either a Terraform run or a Script Function (Python, Bash, or Docker container).

The interesting part is output passing. When a Terraform apply completes, its outputs (e.g., a cluster endpoint, a load balancer IP, a database hostname) need to be available to the next step without manual copy-paste. We built a dynamic variable resolution system using reference strings:

ref:var:VARIABLE_NAME — resolves a variable defined at the provider or workflow level
ref:provider:ATTRIBUTE — resolves an attribute from the connected provider
ref:function:OUTPUT_KEY — resolves an output captured from a previous Script Function step

At runtime, the backend resolves these references before passing the step payload to the agent. This means you can provision an EKS cluster in step 1, capture its endpoint as ref:var:CLUSTER_ENDPOINT, and pass it to a Helm install in step 2 — without hardcoding anything.

Script Functions run in isolated Docker containers or as Kubernetes jobs, with secrets injected as environment variables. Output is captured from stdout as structured key-value pairs and made available to subsequent steps via ref:function:.

State Isolation Per Provider

One of the early architectural decisions that saved us the most headaches: one state backend per provider, not one global state.

The alternative — a single shared state backend with workspaces — makes cross-provider references tempting and creates blast radius problems. A corrupted state file for your AWS provider shouldn't touch your GCP resources.

In practice, each provider object in TerraX has its own isolated state stored in the backend. When the agent runs a plan or apply, it pulls the state for that specific provider, executes against it, and pushes the updated state back. Providers never share state files.

The tradeoff: cross-provider data references have to go through the workflow engine's variable system, not Terraform's native data sources across workspaces. For most use cases this is fine. For deeply coupled multi-cloud architectures it adds some friction — something we're still working through.

Compliance Scanning: OPA + Checkov

Every Terraform configuration that passes through TerraX can be scored against compliance frameworks before apply. The scanning layer runs two tools:

Checkov handles static analysis of the .tf files — it understands Terraform's resource schema and checks for common misconfigurations (public S3 buckets, unencrypted RDS, overly permissive security groups, etc.).

OPA (Open Policy Agent) handles custom policy enforcement via Rego policies. This is where you encode organization-specific rules that Checkov doesn't cover — things like "all GCP resources must have a cost-center label" or "no EKS node groups with instance types larger than m5.xlarge without approval."

Results are scored against PCI DSS, CIS Benchmarks, NIST CSF, and SOC 2 controls, aggregated per provider, and surfaced as a compliance score in the dashboard. One-click "fix preview" shows the Terraform change that would resolve a finding.

The Browser Extension

The Chrome extension deserves a brief mention because it solved a real problem: existing Cloudflare resources.

When you have 200 DNS records, 30 WAF rules, and 15 Page Rules already configured in the Cloudflare dashboard, bringing them under Terraform management is painful. You need to write HCL for each one and run terraform import with the correct Cloudflare resource ID.

The extension runs on the Cloudflare dashboard pages, detects resource configurations as you browse, and sends them to TerraX where they're staged for import. It effectively automates the discovery + HCL generation step for resources you're already looking at in the UI.

What We Got Wrong

Abstracting too much from Terraform's output. In early versions we summarized the plan output into a "changes" view. Users hated it. They wanted to see the raw plan, exactly as Terraform prints it. We reverted to showing the full output with syntax highlighting.

Assuming exit codes were reliable. Terraform's exit code behavior across versions and providers is inconsistent enough that we had to parse stdout/stderr directly to determine plan/apply success. Lesson: don't trust exit codes alone for infrastructure tools.

Underestimating the import UX. The most-used feature after launch wasn't resource creation — it was state import. People have years of existing infra they want under management. We initially treated import as a secondary flow and had to rebuild it properly after the first wave of feedback.

Current Limitations

Drift detection is manual — no scheduled reconciliation against live cloud yet
Azure resource coverage is thinner than AWS and GCP
Generated HCL isn't always idiomatic for complex Terraform module structures
ArgoCD SSO not fully tested end-to-end against all identity providers

Try the Beta

TerraX is in open beta. 14-day free trial, no credit card required.

App: app.terrax-cloud.com
Docs: docs.terrax-cloud.com

If you have feedback — especially critical feedback on the architecture decisions above — I'm in the comments.

DEV Community