DEV Community: Jeleel

The Future of Infrastructure Is Control Surfaces

Jeleel — Sat, 18 Apr 2026 19:12:27 +0000

Most infrastructure tooling exposes configuration primitives. It gives you resources, parameters, and state management. What it rarely gives you is a stable operational interface: a set of actions that map to what an operator actually needs to do, rather than what the underlying system needs to receive.

This distinction sounds subtle. In practice, it changes how engineering teams reason about their infrastructure, how they onboard and develop operators, and how they recover when things break under pressure.

What operators actually need

When something goes wrong at 2am, the engineer on call is not thinking about resource graphs or provider schemas. They are thinking about operational actions: restart this service, promote this replica, reroute traffic away from this node, validate that the environment is still healthy.

None of those actions map cleanly onto raw tool invocations. The commands do something, but the operator has to carry the translation in their head. Which playbook applies here? Which variable file? Which inventory? Which order? Against which target? Under which constraints?

That translation is where errors happen under pressure. And it's not the operator's fault. It is the natural consequence of exposing configuration primitives as the primary operational interface. The tools were not designed to be the interface. They were designed to be the implementation.

The control surface model

A control surface is a stable layer that sits between the operator and the underlying tooling. It exposes actions, not configuration. The operator declares intent ("deploy this environment", "test this failover path", "validate this service") and the control surface handles the rest.

# What a raw tool invocation looks like
ansible-playbook \
  -i inventories/lab/hosts.yml \
  -e "target_env=lab db_replica_mode=promote" \
  playbooks/db-promote-replica.yml

# What a control surface invocation looks like
hyops run db-promote-replica --env lab --profile production-safe

The underlying playbook runs in both cases. But in the second case, the control surface validates pre-conditions, enforces profile constraints, captures the run record, and surfaces a normalised result. The operator works at the level of intent. The platform handles the mechanical translation.

This is not a new idea. It's how well-designed systems have always worked. A network control plane doesn't ask the operator to write forwarding table entries by hand. The operator sets policy; the system derives the state. The control surface is the stable interface. The implementation beneath it is an engineering concern, not an operational one.

Infrastructure has been slower to adopt this model, partly because the tooling has historically been capable enough at the command level that building a stable layer above it felt like overhead. That calculus is changing as environments grow more complex and team composition becomes more dynamic.

Why this matters more as systems grow

Single-tool, single-operator environments can survive without a control surface. The operator is the control surface. They carry the context in their head and translate intent into tool invocations directly.

That model doesn't scale. It doesn't survive team rotation. It doesn't survive incidents, where cognitive load is already high and the last thing you need is to reconstruct the correct sequence of command-line flags from memory while the Slack thread fills up.

The more complex the environment, the more valuable the stable interface becomes. I've noticed that teams with stable operational interfaces respond to incidents differently, not because their engineers are more skilled, but because they spend less of their incident time reconstructing what to run and more of it on what is actually wrong. A hybrid environment with on-prem compute, cloud landing zones, a Kubernetes platform, database HA, and GitOps delivery chains cannot be operated safely through raw tool invocations alone. The surface area is too large. The coordination requirements between tools are too high.

The typical response is increasingly complex wrapper scripts. They work until they don't, and they tend to break in the specific failure modes that weren't anticipated when the script was written. A control surface is a different approach. It is designed from the start to be the operational interface, not bolted on later as a workaround.

What a stable interface actually provides

When the operational interface is stable, a few things become possible that are difficult or impossible with raw tool invocations.

First, testing. If the interface is stable, you can test it. Run an operation in a lab environment, observe the result, validate the run record, and be confident that the same operation will behave the same way in production, because it goes through the same interface with the same constraints applied. The tooling underneath may differ, but the contract doesn't.

Second, training. New engineers can learn the operational model without needing to understand every tool in the stack before they can do anything useful. They learn what actions are available, what pre-conditions they expect, and what the output looks like. The underlying tooling becomes implementation detail rather than prerequisite knowledge. This matters particularly for teams that rotate engineers or bring in people for specific delivery windows.

Third, governance. A stable interface is a natural enforcement point. Access controls, environment constraints, approval gates, audit records: these belong at the interface layer, not scattered across individual tool invocations that different engineers may run differently.

Where things are going

HybridOps is built around this idea: expose operational actions, not configuration primitives, as the primary interface. Modules declare intent. Drivers execute against real targets. The operator works through the control surface; the tooling is an implementation concern.

Whether that specific model turns out to be the right one is an open question. But the underlying direction seems clear: as infrastructure environments grow more complex, the teams that operate them well will be the ones who invested in stable operational interfaces, not the ones who accumulated the most raw configuration management expertise.

Tool fluency will always matter. But it's not enough on its own. The interface above the tools is where operational reliability gets built and sustained over time.

Terraform Is Not a Platform — And That's Okay

Jeleel — Sat, 04 Apr 2026 08:41:18 +0000

Terraform changed how engineers think about infrastructure. Before it, provisioning was a combination of shell scripts, console clicks, and tribal knowledge held by whoever set the environment up. After it, you could express infrastructure as code, run it repeatedly, and get consistent results across environments. That is a genuine and lasting shift.

But a pattern keeps appearing across teams I've spoken to and environments I've worked in: Terraform ends up carrying responsibilities it was never built to handle. It becomes the provisioning engine, the operational interface, the validation layer, and the lifecycle manager — all at once. And at some point the cracks appear.

The tool is not the problem

Terraform is an execution engine. It takes a declared desired state, computes a diff against the current world, and applies the delta. That is what it was designed to do, and it does it well. The state file is reliable. The plan output is readable. The provider ecosystem is mature.

The problem is what teams build around it — or more often, what they don't build. When terraform apply becomes the answer to every operational question, the tool becomes a platform by default. And a tool pressed into a role it wasn't designed for will eventually show you exactly where its boundaries are.

What a platform actually does

A platform provides a stable operational interface above the execution layer. The difference is easier to see with a concrete comparison.

Running Terraform directly, your operational vocabulary looks like this:

terraform init
terraform plan -var-file=lab.tfvars -out=tfplan
terraform apply tfplan
terraform destroy -target=module.compute

Those are execution primitives. They are correct and useful. But they carry no opinion about when it is safe to apply, who verified the plan, what the environment should look like after the operation completes, or what to do if it doesn't.

In a platform model, the underlying tools still run — Terraform still applies, Ansible still configures — but the operator works at a different level. They run a module that declares intent. The platform drives the execution sequence, enforces pre-conditions, captures a run record, and surfaces the result in a normalised format. The operator does not need to wire the tools together manually on every run.

This is not abstraction for its own sake. It means a less experienced engineer can safely execute an operation that a senior engineer designed. It means that operation can be tested in isolation. It means you have something to point to when you need to show what ran and what the result was.

Where the gap appears

The gap between "we use Terraform" and "we have a platform" shows up in predictable places.

Operations that require coordinating multiple tools — provisioning with Terraform, then configuring with Ansible, then validating a service endpoint — have no clean home in a raw Terraform workflow. Someone writes a wrapper script. The script grows. I've seen this pattern in organisations of every size — and the script almost always grows faster than anyone expected when it was first written. Six months later nobody fully owns it, and it breaks in the specific failure mode that wasn't anticipated when it was written.

Governance disappears. Terraform has no opinion about whether an apply is appropriate at this moment, in this environment, for this operator. That judgment lives in the person running the command. A platform can encode those constraints. A naked tool invocation cannot.

Evidence is sparse. terraform apply produces a plan and updates the state file. It does not produce a structured record of intent, pre-condition state, post-apply validation results, or a redacted log tied to the run. Those things matter when something goes wrong, or when someone external needs to understand what the platform did and why.

The right relationship

Terraform is a strong foundation precisely because it stays in its lane. It is a reliable, well-understood execution engine with a large provider ecosystem and predictable semantics. Those are genuine strengths worth preserving.

The question is not whether to use Terraform. It is what operational layer to build on top of it. The stable interface, the governance model, the verification path — those belong above the tool, not inside it.

HybridOps is built around that separation: modules declare intent, drivers invoke Terraform (and other tools) against real targets, and every run produces a structured record. Terraform remains exactly what it is — a capable execution engine — and the platform layer handles what Terraform was never meant to handle.

What this actually means

If you're running infrastructure with Terraform today, the useful question is not "should we switch tools?" It is: what does your operational layer look like? Who controls when things run? What do you capture when they do? What happens when they don't?

For some teams, the answer is a thin wrapper that standardises how plans are reviewed before they are applied. For others, it means a more complete separation between what the platform declares and how drivers execute it. The right shape depends on the environment, the team, and the failure modes you have actually hit.

Building a thorough Terraform estate does not give you a platform. It gives you raw capability — valuable capability — that still needs a coherent operational model around it.

Terraform was never trying to be a platform. That is not a limitation. It is a reason to be precise about what platform engineering actually involves — and to build the right layer for the job.

Platform overview: https://hybridops.tech/why
Documentation: https://docs.hybridops.tech