Shailendra Singh for MechCloud

Posted on Apr 20

The Hidden Engineering Cost of Infrastructure as Code Providers (And How We Fix It)

#devops #platformengineering #infrastructure #automation

Recently, the engineering team at Pydantic announced that they have built official Infrastructure as Code providers for Logfire. This is a massive milestone for their ecosystem and a tremendous benefit for their users. However, seeing their announcement and reflecting on the sheer amount of work required to release these providers brings a much larger, structural industry problem to the surface.

If you are reading this on dev.to, there is a very high chance you have written declarative configuration files to spin up servers, configure databases, or set up routing rules. Modern enterprise environments are fundamentally built upon complex integrations. Organizations do not rely on a single software vendor anymore. Instead, they string together dozens of specialized tools, platforms, and services. In this landscape, Infrastructure as Code has universally become the preferred way to manage the various resource types exposed by software vendors. More generally, it is the absolute standard for managing any platform backed by REST APIs.

Whether you are configuring cloud computing resources, setting up observability pipelines, or defining access control policies, declarative infrastructure is the gold standard for Platform Engineering and Site Reliability Engineering. But underneath this highly streamlined user experience lies a massive, fragmented, and redundant engineering burden that most developers never actually see.

The Friday Afternoon Deployment Nightmare

Before we dive into the architecture, picture this highly relatable scenario. It is a Friday afternoon. You are tasked with making a minor update to a production environment. You just need to add a simple billing tag to an existing cloud database. You update your configuration file, run your planning command, and quickly glance at the output. You type "yes" to apply the changes.

Suddenly, alarms start going off. The database was not just updated. It was completely destroyed and recreated. Your application is now experiencing catastrophic downtime because the newly created database is completely empty.

Why did this happen?

The configuration tool calculated the difference between your desired state and the current state. It then relied on hardcoded logic written by an integration developer to determine if the specific change could be applied in place. Because the logic in that specific provider version was slightly outdated or contained a minor bug regarding tag management, it mistakenly triggered a full destruction of the resource. This terrifying scenario is entirely common in DevOps, and it perfectly highlights why the current model is broken.

The Structural Problem in the Infrastructure Ecosystem

There is a deep structural problem in how the modern infrastructure ecosystem operates. Many Infrastructure as Code vendors build their providers in close collaboration with massive, well-established cloud platforms. If you are an enterprise cloud giant, you have entire teams dedicated to maintaining official partnerships with infrastructure toolings. This system works exceptionally well for the big players.

However, this dynamic creates a severe disadvantage for technology startups and smaller platforms. Startups inherently lack the massive market visibility of the tech giants, yet they are desperately trying to establish themselves within strict enterprise environments.

For a new platform to truly make a mark in the enterprise space, it often needs to build and maintain its own infrastructure providers. Furthermore, the modern ecosystem is highly fragmented. A vendor cannot simply build one integration. They must build separate providers for different tools, languages, and orchestration frameworks to satisfy diverse customer demands.

This requirement is not trivial. It demands a significant and continuous engineering effort. Startups are forced to divert their most valuable resource, which is their engineering time, away from their core product innovation just to build and maintain integration wrappers. They must keep up with evolving upstream APIs, handle ongoing maintenance, fix integration bugs, and manage multiple open source repositories.

The Open Source Maintainer Burnout Crisis

This fragmentation also leads to a severe human cost. Managing these integrations frequently falls onto the shoulders of open source maintainers or small community teams. Whenever a platform releases a new feature, adds a new configuration option, or changes how an endpoint behaves, the corresponding integration providers immediately become outdated.

Maintainers are then bombarded with bug reports and feature requests. They must constantly review pull requests, write custom mapping code, and release new binaries across multiple registries. The sheer fatigue of maintaining state management logic for an external platform is a massive driver of open source burnout. The community is constantly playing a game of catch-up with the platform backends, and it is a game that is impossible to win under the current architectural paradigm.

The Anatomy of a Provider and The Two Critical Concerns

To understand why building these integrations is so painfully difficult, we must look at how declarative infrastructure actually works under the hood. Every single provider must meticulously implement two critical concerns to function correctly.

First, the provider must handle the Validation of the desired state. When an engineer writes a configuration file, the provider must verify that the defined properties are acceptable. It must check if strings meet length requirements, if integers fall within allowed ranges, and if mutually exclusive properties are not defined together.

Second, and far more complex, the provider must determine whether a resource will be updated in place or needs to be recreated.

As we saw in the Friday deployment nightmare, accurately predicting this behavior is absolutely essential for safe Site Reliability Engineering practices. Changing the underlying storage engine of a database almost certainly requires a complete destruction and recreation of the resource. Determining this behavior accurately is the single most important responsibility of the integration tool.

The Limitations of OpenAPI Specifications

When engineers look for solutions to automate this process, they naturally turn to OpenAPI specifications. This standard has revolutionized how we document and interact with REST APIs.

For the first critical concern, which is the Validation of the desired state, a high quality specification is incredibly powerful. It provides the exact schemas, data types, and constraints needed to ensure a user configuration is valid before it is ever sent over the network.

To some extent, OpenAPI specifications can also attempt to help with determining update versus recreate behavior. You might use specific annotations or custom extensions within the specification to hint at which fields are immutable.

But in actual enterprise practice, this is never enough. The reality of infrastructure management is far too complex for simple static annotations.

The Duplication Trap and the Black Box

The complex business logic that decides whether a specific resource is updated or recreated lives deep inside the backend of the platform itself. It is a highly dynamic set of rules influenced by database constraints, legacy architectural decisions, and current system states.

Crucially, this intricate logic is almost never fully captured in OpenAPI specifications. Furthermore, it is rarely available in public facing documentation.

Because of this lack of visibility, every single infrastructure provider ends up falling into the duplication trap. The engineers writing the provider must manually re-implement the platform backend logic independently within their own codebase. They have to hardcode rules stating that if property X changes, then trigger a recreation, but if property Y changes, trigger an update.

This approach is fundamentally flawed. It is incredibly error-prone and highly inefficient. When the backend engineering team updates their internal logic to support a new in-place update feature, the infrastructure provider has no automatic way of knowing. The provider is now completely out of sync, forcing resources to be deleted unnecessarily until a developer manually updates the provider codebase, cuts a new release, and forces all customers to upgrade their command line binaries.

Ideally, this critical business logic should absolutely not live inside the provider code. Duplicating backend state machines into external integration tools violates the most basic principles of software engineering.

Designing the Platform Driven Update Endpoint

There is a much better way to architect this entire ecosystem. The logic determining the lifecycle of a resource should remain exactly where it belongs, which is inside the platform itself.

The platform should expose this internal logic through a dedicated API endpoint. Let us visualize what this actually looks like in practice. Imagine you are building a platform that manages virtual machines. Alongside your standard endpoints for creating and deleting machines, you expose a specialized evaluation endpoint.

The sole purpose of this endpoint would be to accept a JSON payload containing the current state of a resource and the proposed desired state requested by the user. The platform would then analyze this comparison using its actual internal backend rules and return a definitive response. The response would strictly dictate whether the given change results in an in-place update or requires a full recreation.

This architectural shift would be completely transformative. It would allow any external integration vendor to rely on a single, absolute source of truth instead of duplicating the same fragile logic across multiple distinct providers. The platform backend remains the single authority on its own capabilities.

If the backend logic improves tomorrow, the external tools automatically inherit that improvement the very next time they query the evaluation endpoint. No pull requests to external repositories are needed. No provider version bumps are required. The ecosystem becomes deeply resilient by design.

The Impact on Artificial Intelligence and Next Generation Tooling

This architectural shift is not just about making life easier for developers writing traditional infrastructure tools. It is becoming an absolute necessity as we move rapidly into the era of AI agents and autonomous operational systems.

We are quickly approaching a future where AI agents will be responsible for provisioning, scaling, and repairing infrastructure automatically based on natural language prompts or system alerts. For these agents to function safely, they require pristine, highly accurate, and machine readable documentation.

Currently, poor specifications become a massive bottleneck for AI agents. If an artificial intelligence attempts to modify a complex environment, it needs absolute certainty about the blast radius of its automated actions. If the API documentation does not clearly dictate what causes a destructive recreation, the agent might accidentally delete a production database while attempting to simply modify a monitoring tag.

By centralizing the update versus recreate logic behind a dedicated, machine queryable endpoint, we provide AI agents with the exact deterministic feedback they need to operate safely within complex enterprise environments. Writing a high-quality OpenAPI specification is already essential, but pairing it with an authoritative lifecycle endpoint makes a platform truly ready for the next generation of autonomous operations.

The MechCloud Approach: Pioneering Stateless Infrastructure

At MechCloud, we recognized this massive industry inefficiency and decided to take a completely different approach. We strongly believe that integrating a new platform should not require months of dedicated engineering work. We want to make it significantly easier both for users of our platform and for external platforms that want to seamlessly integrate with us.

We are proudly championing the concept of Stateless IaC. In our operational paradigm, the heavy lifting is completely removed from the external integration layer.

As a platform owner looking to integrate your services, you no longer need to worry about writing custom Go code or maintaining complex external state management systems. You only need to focus on two distinct, highly manageable tasks.

First, you must focus on writing a high-quality OpenAPI specification. As previously established, this is already a fundamental requirement for modern software development. A rich, accurate specification provides the essential foundation for validation, documentation, and automated interaction.

Second, you simply need to expose a dedicated endpoint that determines the update versus recreate behavior for brownfield resources. You expose your internal logic securely and efficiently.

With just these two elements properly in place, our system can completely onboard any modern platform or REST API in less than 30 minutes. We achieve full integration without requiring you to build, debug, or maintain a dedicated infrastructure provider.

Centralizing a Global Engineering Concern

The MechCloud approach effectively centralizes a global concern that has plagued the DevOps industry for years.

By adopting this model, you define your validation rules and your update behavior exactly once, in exactly one place. That place is your own codebase, where your engineers already have full control and visibility.

There is absolutely no need to update multiple downstream providers every single time your internal logic evolves. There is no need to wait for third party ecosystem maintainers to approve your pull requests. Furthermore, there is no need to forcefully propagate those changes across all of your enterprise customers and force them through painful upgrade cycles.

The current system of fragmented provider development is a massive drain on the software industry. It taxes startups, delays product shipping, introduces open source maintainer burnout, and creates unnecessary risks within production environments. By moving toward a standardized model of rich API specifications paired with authoritative lifecycle endpoints, we can completely eliminate the hidden costs of infrastructure management.

We can finally move back to focusing on what actually matters, which is building incredible products and delivering actual value to our users. The future of infrastructure is stateless, declarative, and radically simplified.

What Are Your Thoughts?

Have you ever experienced a deployment nightmare caused by a bug in an Infrastructure as Code provider? Have you ever had to maintain one of these complex integrations yourself? I would love to hear your horror stories and your thoughts on moving toward a Stateless IaC model. Drop your experiences in the comments below so we can discuss the future of Platform Engineering together.

DEV Community