DEV Community: Jim Counts

The Terraform Namer Pattern: Making Consistent Naming Easy at Scale

Jim Counts — Mon, 30 Jun 2025 02:14:45 +0000

Naming Is Infrastructure

If you work in the cloud, you've probably run into this: a resource with a name that doesn't quite follow the convention — or doesn't follow any convention at all.

At first, it seems harmless.

However, as environments expand, teams scale and automation layers accumulate, inconsistent naming becomes a significant liability. CI/CD pipelines break. Logs become unreadable. Cross-environment lookups get fragile. And the next engineer wastes hours trying to guess what "rg-prod-east-xyz" is supposed to be.

In this post, I'll share a pattern I've used to solve this at scale.

The Creeping Pain Point

I was working on a project with a customer that had a mature IT department with well-defined naming conventions — not just for VMs and switches, but for every on-prem resource you could imagine. To their credit, they'd already updated those standards to cover cloud resources, too, even though, at the time, they didn't have anything in Azure yet.

I'll admit I didn't love the naming convention. It was a bit… ugly. But the customer's always right. As we set up their new Azure environment using Terraform, we did our best to follow their guidelines.

But then the mistakes started.

Sometimes, someone forgets the correct order of the tokens in a resource name. Other times, a token would be left out. Or worse — someone would invent their own "extension" to the standard, tossing in an extra token to suit a team-specific use case.

Most of these mistakes were unintentional. But they caused real pain.

You might think, "No big deal — just fix the name and redeploy."

Except we didn't always catch the problem early. In some cases, the resource was already in service, with data or downstream dependencies. Changing the name meant replacing the resource. Which, in practice, meant we were stuck with an incorrectly named, non-compliant resource. Forever.

The Problem: The `name` Property Is Too Flexible

Every Azure resource has a name property, and that property accepts a plain string. Any string. No rules. No structure. It's just a blob of characters — valid as long as Azure doesn't reject it. But Azure's naming rules are based on technical constraints, not your company's naming conventions.

When building our Terraform modules, we followed the same pattern as the encapsulated resources. We created an input variable called name, typed it as a string, and left it up to the individual developer calling the module to follow the documented naming convention.

Outside the module, we tried to help with local variables like resource_prefix or env_tag to build partial names more consistently. But at the end of the day, we were still pasting together fragments of strings. It was entirely up to each developer to get it right.

And inevitably, someone didn't.

Not because they didn't care — but because strings are easy to get wrong. Forget a token, change the order, add an extra piece "just this once," and suddenly, you've got a non-compliant name. Terraform doesn't care. Azure doesn't care. But your platform team does.

The result? We ended up with a mix of:

Partially named resources that didn't include environment or region
Overloaded names that stuffed in too much information
Resources that looked similar but didn't follow the real pattern

Even with good intentions, we couldn't enforce naming consistency — because the system provided no guardrails.

Naming That Just Works

Imagine every resource name is consistent. Compliant. Predictable.
You don't have to remember the order of tokens — or whether it's "prod-east" or "east-prod."
You don't even think about naming — because it's generated for you, automatically, and always correct.

Alerts and logs make sense, because the names they reference follow a known pattern.
Terraform can locate resources by convention — using data blocks and naming rules — instead of relying on remote outputs or hardcoded names.
You never have to choose between a painful resource migration or living with a non-compliant name.

And best of all? Developers can't easily get it wrong.

They don't pass in arbitrary strings anymore. Instead, they provide structured inputs — like environment, location, and service name — and let the naming logic handle the rest.

From Convention to Code

Before you can codify your naming convention, you need to have one.

As I mentioned earlier, many of my clients already had naming standards in place. But if you're starting from scratch, Microsoft's Cloud Adoption Framework is a great source of inspiration.

We adapted ideas from the CAF structure to match how the team actually thought about their infrastructure. For example:

rg-dev-centralus-svc-identity-0

Where:

Token	Meaning
`rg`	Resource type (`resource group`)
`dev`	Environment (`development`)
`centralus`	Azure region
`svc`	Workload grouping (`services`)
`identity`	Application or service name
`0`	Instance identifier (ordinal)

We ordered tokens from general to specific to support predictable sorting, filtering, and scanning.

But the important part isn't the order — it's consistency.

Ask yourself: What matters most when scanning names?
If it's resource type, put it first.
If it's app name, lead with that.
Pick an order that makes sense for your team and stick to it.

Once your structure is defined, the next step is to codify it.

We created a lightweight namer module with this interface:

variable "application" {
  default = null
  type    = string
}

variable "environment" {
  type = string
}

variable "instance" {
  default = null
  type    = number
}

variable "location" {
  type = string
}

variable "workload" {
  type = string
}

The implementation is simple but purposeful:

output "resource_suffix" {
  value = join("-", compact([
    var.environment,
    var.location,
    var.workload,
    var.application,
    var.instance
  ]))
}

Optional tokens (application, instance) are placed last.
Required tokens are always present.
compact() strips out null values, so unused fields don't leave gaps.

Here's how we typically use the namer module inside a resource module — like one that provisions a resource group:

module "namer" {
  source      = "../namer"
  environment = var.environment
  location    = var.location
  workload    = var.workload
  application = var.application
  instance    = var.instance
}

resource "azurerm_resource_group" "this" {
  name     = "rg-${module.namer.resource_suffix}"
  location = var.location
}

And in the higher-level calling module:

module "identity_rg" {
  source      = "../modules/resource-group"
  environment = "dev"
  location    = "centralus"
  workload    = "svc"
  application = "identity"
  instance    = 0
}

The caller doesn't have to build the name manually or remember the token order — they just pass structured values, and the module takes care of the rest.

💡 Notice in the resource module that the namer only supplies the resource suffix, not the full name. The resource module itself provides the prefix (rg). This separation of concerns keeps the namer module reusable — it can be embedded in any resource module.

Even Microsoft Built a Namer

We're not the only ones to notice the need for codified naming.

Around the same time I wrote my first namer, Microsoft released an official Terraform module for naming Azure resources. Their module constructs names using inputs such as prefix, suffix. It's flexible by design, which makes it broadly applicable across thousands of organizations.

And while we share the same goal (consistency), our approaches reflect different audiences:

Microsoft has to serve everyone. I just need to serve my clients — and get it right for them.

My namer module is opinionated by design. It expects structured inputs such as environment, location, workload, application, and instance. It handles optional tokens predictably, and the output is consistent.

This approach allows me to codify domain-specific structures. For example, one of my customers organizes infrastructure by program, grouped into solutions, each with multiple applications. That's easy to reflect in a structured namer. For Microsoft, building a module that covers all such variations would be nearly impossible.

So while both modules solve the naming problem, they serve different needs.

Answering Common Objections

Developers Can Still Pass Garbage Into the Namer

Absolutely — and that's a valid concern.

Just because we've wrapped naming in a module doesn't mean the problem goes away. Developers can still pass invalid strings into location, environment, workload, or any of the other tokens. It's entirely possible to write:

location    = "CentralUs"
environment = "Development"

…and end up with a name that breaks consistency or violates Azure constraints.

The problem isn't the module — it's unvalidated input.

Terraform gives us tools to fix this, using validation blocks on input variables:

variable "location" {
  type = string
  validation {
    condition     = contains(["centralus", "eastus2", "westeurope"], var.location)
    error_message = "Location must be one of: centralus, eastus2, westeurope."
  }
}

variable "environment" {
  type = string
  validation {
    condition     = contains(["dev", "test", "prod"], var.environment)
    error_message = "Environment must be one of: dev, test, prod."
  }
}

These constraints eliminate "almost right" values like Production, east-us, or qa1 — small inconsistencies that erode standardization over time.

And this is another reason the module matters: it centralizes validation logic.

Even if a downstream resource module forgets to validate environment or location, the namer module ensures only known-good values are accepted. That makes it easier to scale across teams and repositories without trusting everyone to remember every rule, every time.

HashiCorp Says to Avoid Nested Modules

HashiCorp's guidance recommends being cautious with module composition. Specifically, they warn that deeply nested modules can make Terraform harder to reuse, test, and understand. And they're right — in general.

But let's unpack what that guidance actually means.

⚠️ The problem isn't nesting itself — it's unstructured, deep, or unnecessary nesting.

In our case, we're embedding a small, single-purpose utility module — the namer — inside a resource-specific module (like one that provisions a resource group or app service). That's not deep or complex. It's deliberate encapsulation.

Here's why this pattern works well in practice:

No added control flow — The namer has no dependencies, branching, or side effects. It just returns a string.
Improved DRY and correctness — Without it, every resource module would need to duplicate the naming logic — and probably do it inconsistently.
Testable in isolation — The namer can be unit-tested separately or used directly outside nested contexts.
Simpler for consumers — Callers only provide structured context. They don't need to understand or maintain the naming format.

At scale, where consistency is crucial and modules are reused across teams and environments, this lightweight composition pattern has paid off again and again.

A Lot of Work Just to Build a String

At first glance, the namer module might look like overkill. It just produces a formatted string, right?

But in practice, we've extended the namer to cover a variety of real-world scenarios — especially the inconsistent naming requirements across Azure services.

Different Azure resources have different naming rules:

Some require lowercase alphanumeric only
Some disallow hyphens
Some have character limits as low as 24
Some allow longer, more expressive names

Besides the full resource suffix, here's what the module provides:

output "resource_suffix" {
  description = "A standardized resource suffix combining environment, location, workload, application, and instance identifiers. Use this with a resource type prefix to create consistent resource names across your infrastructure."
  value       = local.resource_suffix
}

output "resource_suffix_compact" {
  description = "A compact version of the resource suffix with all hyphens removed. Useful for resources with strict length limitations or naming conventions that don't allow hyphens."
  value       = replace(local.resource_suffix, "-", "")
}

output "resource_suffix_short" {
  description = "A shortened resource suffix using abbreviated environment and location codes. Designed for resources with restrictive naming length requirements while maintaining readability."
  value       = local.resource_suffix_short
}

output "resource_suffix_short_compact" {
  description = "A shortened and compact resource suffix with abbreviated codes and no hyphens. Ideal for resources with stringent length limitations."
  value       = replace(local.resource_suffix_short, "-", "")
}

By centralizing the logic, we:

Reduce duplication — One implementation, many consumers
Enforce consistency — Developers can't "almost follow" the pattern
Support constraints — Compact and short formats are pre-baked

And as a bonus?

We also generate standardized tags:

output "tags" {
  description = "Standardized tags including application, creation date, DevOps team, environment, owner, repository, source, and workspace information. These tags follow organizational tagging standards for resource management and cost allocation."
  value       = local.tags
}

With just a few more input variables, the same namer module can output tagging dictionaries that:

Drive cost management and showback
Enforce platform tagging policy
Improve search and grouping in the Azure Portal
Make incident response and ownership tracking easier

💡 The namer isn't about abstraction for abstraction's sake.
It's about operational predictability and platform integrity.

Make the Right Thing the Easy Thing

Naming might seem like a minor detail — until it's not. When naming breaks down, platforms become harder to navigate, automation becomes brittle, and developers waste time chasing avoidable errors.

The Terraform namer pattern isn't magic. It's a small, opinionated module that codifies your naming strategy, gives teams a consistent interface, and reduces the surface area for human error.

By investing a little upfront effort to centralize and automate naming (and tagging), you gain:

Predictable infrastructure that's easier to support
A shared language for your team and your tools
Guardrails that catch problems before they land in production
A stronger foundation for growth and reuse

If you're tired of fixing naming issues after the fact — or if you're scaling a Terraform-based platform across teams — this pattern can save you headaches later.

Need Help Putting This Pattern to Work?

I've seen this pattern save teams from endless frustration — and I've helped organizations of all sizes implement it across their platforms.

If you're wrestling with naming drift, Terraform sprawl, or platform inconsistencies, let's talk.

👉 Connect with me on LinkedIn — I'd love to hear what you're building and see how I can help.

This post was originally published on jamesrcounts.com. If you found this helpful, consider sharing it with your team or following me for more infrastructure and DevOps insights.

Why Your Terraform Platform Isn't Scaling—and What to Do About It

Jim Counts — Mon, 23 Jun 2025 01:29:55 +0000

Most Terraform blog posts start at the middle layer—deploying infrastructure like networks, services, or security policies. But that assumes something important: that your Terraform platform is already in place.

Before you deploy a single subnet or virtual machine, you need to establish the foundation that makes Terraform work at scale. That foundation is the root layer—and getting it right means the difference between a fragile pile of scripts and a scalable, governed infrastructure platform.

In this post, I'll share how I structure the root layer to support multi-environment, multi-team Terraform setups using Terraform Cloud and GitHub (or Azure DevOps). This isn't theory—it's what I've learned after multiple iterations across real-world orgs.

Production Was Perfect. Everything Else Still Ran on Tickets.

In many cloud environments, infrastructure as code has revolutionized how we deploy applications. Terraform, pipelines, and Git workflows let us spin up production-ready systems with confidence and speed.
But there's a catch: the automation itself often runs on an unautomated foundation.

While application environments are managed as code, the back office—the systems that support your infrastructure—remains a patchwork of manual processes, ticket queues, and tribal knowledge. Think service principals, repo permissions, pipeline bootstrapping, secrets rotation.

This is a problem I first ran into at a financial services company during one of my earliest large-scale Terraform automation projects. On the surface, we had it figured out. Our Terraform setup was clean. New Azure resources—VMs, subnets, storage—could be provisioned by anyone on the team, no tickets, no waiting. Just a PR, a plan, and a merge. It felt like DevOps was finally working.

But that illusion cracked the moment we needed to touch the platform behind the automation.

If I needed a new service principal in Entra ID, I had to open a ServiceNow ticket.
If I needed access to a Git repo or a shared pipeline library in Azure DevOps, I needed to hunt down the Project Collection Administrator.
If we wanted a new workspace in Terraform Cloud, forget it—we were back to tribal knowledge and manual steps.

The production environment was a modern, automated marvel.
The platform that powered it? A legacy ops bottleneck with no change control and no repeatability.

It was frustrating, but more than that—it was dissonant.

I could build secure, repeatable landing zones with Terraform, but I couldn't automate the identity, pipelines, or secrets that made those zones possible in the first place.

That was the real pain: living in two different worlds. One where DevOps worked. One where it didn't.

Eventually, I realized: the automation platform is part of the platform. And if it's not managed like code, the rest of your automation is standing on sand.

Automating the Automation Platform

Imagine spinning up a brand-new cloud environment—a subscription or resource group and a workspace—complete with its own identity, secrets, and pipelines, all wired into your CI/CD platform. No tickets. No Byzantine approvals. No tribal knowledge.

In this model, you're not just deploying services with Terraform. You're defining the environment in which those services will live.

A scoped Entra ID service principal with the right roles? Code.
A private Git repo with permissions set and a pipeline ready to go? Code.
A secure secret store wired into your deployment workflow? Code.
A Terraform Cloud workspace with tagging, policies, and access controls? All in code.

The environment scaffolding itself becomes reproducible: not just VMs and networks, but the platform plumbing—identity, access, and automation.

Even the back-office systems—Terraform Cloud, Azure DevOps, Entra ID—are treated as first-class infrastructure, managed and governed through code just like the application stack.

It's faster. Safer. Repeatable. And crucially, it scales without central bottlenecks.

This is the kind of foundation I started calling the root layer: a baseline of automation that treats the environment as infrastructure, and manages the platform itself as code.

Scalable Platform Automation, by Design

The root layer isn't a single Terraform module — it's a layered architecture that automates the scaffolding of your platform and its delivery environments. Each layer plays a different role, with a different cadence of change:

🧱 Root Workspace

The foundation. This is applied once (or very rarely) and manages your Terraform Cloud organization at the highest level. It establishes global constructs, such as teams, policies, and projects. Most importantly, it enables the automation of Terraform Cloud itself, allowing workspaces, modules, and environments to be managed as code.

🧭 Workspaces Workspace

This layer runs each time a new environment or project-level landing zone is needed. It creates:

Workspaces (including its own and the root workspace)
Azure credentials with the proper scope and RBAC roles
Variable sets and their associations with workspaces
Optionally, Git repositories (when a new repo is needed)

This workspace is revisited as projects grow, new zones are required, or shared pipelines need to be extended. It's the engine behind scaling your platform one secure, self-contained environment at a time.

🧩 Shared Modules Workspace

This workspace provides reusable infrastructure building blocks, such as an App Service module or a shared Virtual Network (VNet) template. Each shared module gets:

Its own Git repository (always 1:1)
A Terraform Cloud registry entry
Automated registration and webhook setup so updates flow directly from Git

This workspace runs whenever a new shared capability is developed, helping teams reuse best-practice infrastructure without duplicating code.

Together, these workspaces provide a platform that can be deployed securely, consistently, and without ticket-driven friction. You don't just automate environments. You automate the ability to create and evolve environments as your organization grows.

Getting Started: Bootstrapping the Root Workspace

If you're ready to adopt this layered platform model, the first step is to bootstrap the root workspace—the foundation that allows Terraform Cloud to be managed as code.

Terraform Cloud can't manage itself until something external creates the organization and initial workspace, so we begin with a short-lived, locally executed Terraform configuration.

Bootstrapping Steps

Author the organization configuration in a local Terraform workspace on your laptop.
Apply that configuration locally to create the Terraform Cloud organization.
Configure the VCS connection:

For GitHub, this can be done via Terraform.
For Azure DevOps, you must manually create the OAuth client in the UI. This is a required ClickOps step.
1. In a separate module/folder, author the code for managing workspaces. This code will:
Create the root workspace in Terraform Cloud.
Create the governance Git repository.
Reference the OAuth client as a data resource (manual or automated).
1. Push the code to the governance repository in your Git provider.
2. Update the backend block in your Terraform config to use the cloud backend tied to the root workspace.
3. Reinitialize the root workspace, uploading the local state to Terraform Cloud.

At this point, your root workspace is fully operational, managing the Terraform Cloud organization itself, backed by version-controlled code.

But that's just the beginning. Next, you'll bootstrap the workspaces workspace, which provisions delivery environments and wires them into your platform.

Bootstrapping the Workspaces Workspace

Once the root workspace is online, the next step is to bootstrap the workspaces workspace—the Terraform workspace responsible for provisioning landing zones and wiring up delivery environments.

The process mirrors the root workspace bootstrap, but with more complexity. While the root workspace only interacts with Terraform Cloud, the workspaces workspace must also communicate with Azure, Entra ID, and your version control system. That means all required credentials must be in place before Terraform can even generate a plan.

Required Credentials

To bootstrap the workspaces workspace, you'll need:

Terraform Cloud token (as used by the root workspace)
VCS token (GitHub or Azure DevOps)
A privileged Azure service principal, with:
- Azure Permissions:
- Reader on the subscription (so Terraform can inspect it)
- User Access Administrator (To assign roles to child service principals)
- Entra ID Permissions:
- Cloud Application Administrator (to create service principals)
- Privileged Role Administrator (Only if you plan to assign Entra ID roles to child principals)

This service principal is essential. The workspaces workspace uses it to create additional service principals for each landing zone, so it must have broad authority across Azure and Entra.

Note: During bootstrap, Terraform authenticates as the user running the code. This user must be highly privileged in the tenant. Use az login beforehand to provide the required Azure and Entra tokens locally.

Bootstrapping Steps

Author the workspaces workspace configuration locally.

Ensure all external credentials are passed in via workspace variables (Terraform Cloud, GitHub/Azure DevOps).
Configure this workspace to create its own Azure service principal as described above.
Create variable sets in Terraform Cloud containing the Azure and VCS credentials. (You should already have a set for Terraform Cloud from the root workspace setup.)

Run the workspace locally, just as you did the root workspace.

This creates the workspaces workspace in Terraform Cloud.
You may reuse the same Git repository (I typically organize root-layer workspaces into separate folders in one repo).

Push the code to Git.
Update the backend block to point to Terraform Cloud.
Reinitialize the workspace, pushing its state to Terraform Cloud.

Once bootstrapped, the workspaces workspace can fully automate environment creation: spinning up project-specific service principals, assigning roles, creating Git repos, configuring pipelines, and wiring everything into Terraform Cloud workspaces.

Creating the Shared Modules Workspace

The final component of the root layer is the shared modules workspace. Unlike the root and workspaces workspaces, this one doesn't require special bootstrapping—it can be defined and provisioned directly by the workspaces workspace, just like any other environment-specific workspace.

Because its role is to publish reusable infrastructure modules, it only needs credentials for two systems:

Terraform Cloud — already handled by the root workspace
Version Control System (GitHub or Azure DevOps) — already configured for the workspaces workspace

Provisioning Steps

Once the necessary credentials are in place, you can define the shared modules workspace inside the workspaces workspace codebase:

Add a new workspace definition for the shared modules workspace in the workspaces workspace.
Apply the workspaces workspace to create the new workspace in Terraform Cloud and associate it with a Git repository.
Author module management code in that Git repo to:

Create a new repository for each shared module.
Register each module in the Terraform Cloud private registry.
Set up webhooks so the registry tracks changes to the module codebase.
1. Push the module management code to Git and let Terraform Cloud do the rest.

This workspace now serves as your publishing engine for infrastructure building blocks—like App Service templates, shared VNets, or other reusable constructs—ensuring they're delivered and tracked with the same rigor as any other environment.

At this point, your root layer is complete:

The root workspace manages Terraform Cloud itself.
The workspaces workspace provisions environments and organizational scaffolding.
The shared modules workspace, delivers reusable infrastructure components.

With this structure in place, you're ready to use the workspaces workspace to provision real, production-ready landing zones—securely, consistently, and with zero ticket friction.

Proven in the Field

This approach isn't just theory. I've implemented variations of this root layer architecture across demos, internal tools, and production environments for real-world organizations—including teams in financial services, non-profits, the health sector, energy, and startups.

In each case, separating concerns across the root, workspaces, and shared modules workspaces gave teams the confidence to move faster—with less friction, stronger governance, and fewer surprise dependencies.

I've watched this pattern scale from small pilots to enterprise-wide platforms. It enables autonomy without chaos, governance without gridlock.

And most importantly, it helps teams ship infrastructure like software, without getting stuck in ticket queues.

What You Might Be Thinking

Why not just use one workspace?

You might wonder why we use three separate workspaces instead of a single, monolithic Terraform configuration to manage everything. The answer comes down to scope, security, and lifecycle.

Each workspace in the root layer has a different purpose, cadence, and trust boundary:

The root workspace manages your Terraform Cloud organization itself. It changes rarely and requires elevated permissions, but only for Terraform Cloud.
The workspaces workspace is your platform automation engine. It provisions project-level environments and needs broad access to Azure, Entra ID, and your VCS. It changes more frequently as new teams and environments are added.
The shared modules workspace manages reusable building blocks. It operates independently from environment provisioning and evolves on its own timeline as new modules are developed.

Keeping these concerns separate makes it easier to:

Apply the principle of least privilege to each workspace
Delegate ownership without compromising the whole platform
Test and evolve parts of your automation independently

It also improves performance and manageability over time. As your environment grows, so does the Terraform state. Separating workspaces means smaller, more focused state files—which translates into faster refresh times and easier troubleshooting as the platform scales.

A monolithic configuration might work for a single team or proof of concept, but it doesn't hold up in a real-world platform engineering scenario. Separation is what keeps the root layer resilient and scalable.

Isn't this overkill for a small team?

Yes—if you're just experimenting with Terraform or spinning up a dev sandbox, this model might feel heavy. But it isn't meant for one-off environments. This is for teams building a reusable, secure platform that others will consume.

Even smaller orgs benefit from separating the concerns of identity, pipelines, secrets, and infrastructure. You don't need a massive team to justify this setup—you just need a need for repeatability.

I've seen this approach succeed with three-person SRE teams and ten-person app teams. It's about how many environments you expect to create, and how often you want to do it without friction.

Isn't it risky to let Terraform manage service principals?

It's natural to be cautious about automating privileged operations. But the alternative is worse: having those actions done manually, out of band, with inconsistent controls and zero audit trail.

The access still needs to exist—whether provisioned by Terraform or by a sysadmin clicking around the portal. With Terraform, you get a repeatable, reviewable process and a complete change history. You can also automate revocation and cleanup when environments are decommissioned.

If security and compliance are concerns (and they should be), infrastructure as code gives you the best shot at managing them responsibly.

Isn't this tied too tightly to Terraform Cloud?

It's true that I use Terraform Cloud for most implementations—it's easy to get started and removes a lot of the heavy lifting.

But this architecture doesn't require it. I've implemented similar root-layer setups using Azure DevOps as the CI/CD backbone, with pipelines responsible for managing Terraform backends and executing plans.

It's more effort to set up the equivalent of TFC's remote execution model yourself, but it can be done. The patterns still apply. In fact, they're even more important when you're building the plumbing by hand.

Doesn't this just recreate the same internal platform that already frustrated us?

It might look that way on the surface—but this time, it's different.

This root layer isn't hidden behind tickets or maintained by a shadow platform team. It's written in code. It lives in Git. It evolves with your organization.

And most importantly: you can fork it. If a team needs to move faster, or create a slightly different delivery model, they're not stuck. They're empowered.

This isn't about central control—it's about shared autonomy, delivered through code.

Ready When You Are

If you've made it this far, you're probably already thinking about how to bring something like this to your own organization. That's great. Start with the root workspace. Take your time. Keep the pieces small and focused.

And if you want to compare notes—or would like help getting your root layer off the ground—I'd be happy to connect. You can find me on LinkedIn or reach out through my site.

Infrastructure gets better when we treat it like software. Platforms do too.

This post was originally published on jamesrcounts.com.

How I Use ChatGPT to Interview Myself (and Why You Should Too)

Jim Counts — Thu, 05 Jun 2025 14:41:45 +0000

Most people treat ChatGPT like a search engine. I treat it like a collaborator. What started as a tool to summarize chat logs evolved into something more powerful: a thinking partner that interviews me. It's how I beat blank-page syndrome, uncover ideas I hadn't consciously considered, and stay focused when the stakes feel high.

It Started With Blank Pages

I often struggle with first drafts. Whether it's documentation, blog posts, or even a LinkedIn recommendation for someone I deeply admire, that first sentence is always the hardest. Revising is easy. Starting is not.

Take the example of writing a recommendation for a colleague I greatly respect. I immediately thought of three or four qualities that make working with her wonderful. But for weeks, I couldn't get a single sentence down. The emotional stakes made the blank page even more intimidating.

Finally, I asked ChatGPT:

"I want to write a LinkedIn recommendation for a colleague — interview me about this colleague and help me write it."

That simple shift unlocked everything. ChatGPT started asking me thoughtful, direct questions. Within 30 minutes, I had a meaningful, polished recommendation ready to share. The process turned emotional friction into flow.

                        [ INTERVIEW EXCERPT ]
                        --------------------

ME:
  "I want to write a LinkedIn recommendation for a colleague.
   Interview me about her."

CHATGPT:
  "How do you know her?"
  "What projects did you work on together?"
  "What stood out about her skills or communication?"

                        --------------------

From there, I just answered questions—one at a time. No pressure to write. Just reflect and respond. Within 30 minutes, I had a fully formed recommendation I was proud to send.

Why it worked: This wasn't just about writing faster. The questions helped me reconnect with what I actually valued about working with this person. It changed the tone of the recommendation from generic praise to something honest and specific.

Why Interviewing Works Better Than Drafting

The magic of this technique is in the back-and-forth. When I prompt ChatGPT with something like, "What else would you like to ask?", it digs deeper and often surprises me with angles I hadn't considered. This creates momentum and structure — a welcome contrast to meandering chat threads or scattered brainstorming sessions.

The result?

Deeper insights
A natural conversation flow
A clear ending (when GPT finally says "I have no more questions")

Unlike traditional drafting, this makes it easy to refer back to the full conversation later. It's self-contained, intentional, and easy to mine for content.

From Prompt to Blog Post: A Real Example

Before I ask ChatGPT to interview me, I try to give it as much context as possible. If I'm replying to an email, I'll paste in the message thread. If I'm working from a presentation, I'll include the slide deck and any abstracts or speaker notes.

Recently, I wanted to turn a conference talk into a blog post. I had the abstract and my PowerPoint deck from the event, but turning that into a coherent, engaging article felt like a bigger task than it should have. So I dropped everything I had—the event description and the deck—into ChatGPT and said:

"I would like to develop this into a blog post."

ChatGPT immediately returned a rough outline. It wasn't a finished piece, but it captured the general structure of the talk—just without much soul. As I reviewed it, I realized something important was missing: a personal story I always tell live. The story isn't in the slides or the abstract, and I'd never written it down—but it's one of the most memorable parts for the audience.

So I prompted ChatGPT again:

"I want to include a personal story about Knott's Berry Farm. It should be two paragraphs long. Interview me about the story so we can include it."

That kicked off another round of questions:

"When did you go to Knott's Berry Farm, and who were you with?"
"What happened during the visit that connects to your HA/DR experience?"
"What emotions do you associate with that moment?"
"If you had proper HA/DR in place back then, how would the day have gone differently?"
"What message do you want readers to take away from this story?"

It was exactly the kind of scaffolding I needed. Within minutes, I had a focused, emotionally resonant anecdote—one that deepened the content and made the message more memorable, just like it does when I give the talk live.

The Knott's Berry Farm story wasn't the only part of the blog post that benefited from this technique. As the post evolved from outline to draft, I used the same interview approach to improve weaker sections. If a paragraph felt shallow or underdeveloped, I would highlight it and ask ChatGPT to interview me about just that idea. It helped me surface better examples, clarify my thinking, and add the kind of detail that turns rough ideas into something useful and complete.

👉 You can read the full story and the final blog post it became part of here:
HA/DR for Developers: Building Resilient Systems Without Losing Sleep

I don't always use fancy prompts. Once the interview is underway, I'll often drop in quick questions to keep it going:

"What else would you ask me?"
"What am I missing?"
"Pretend you're reviewing this as a stakeholder—what concerns would you raise?"

These are signals to ChatGPT that I'm not done yet—that I still have more to say, and I want it to keep digging. The goal isn't just to move on to a draft, but to make sure we've really explored the topic first. When the questions slow down, that's when I know we're ready to synthesize.

The key is giving ChatGPT enough signal to ask meaningful questions—then treating those questions as scaffolding for whatever I'm trying to write.

Tips for Getting the Most From This Technique

Be clear up front: Give GPT context—what you're writing, who it's for, and what format you need.
Let the questions guide you: Don't rush to generate output. Let the questions shape your thinking.
Signal the interview mode: Phrases like "What else would you like to ask?" keep the interaction focused.
Let it end: When GPT runs out of questions, that's your cue to synthesize and generate.

A Tool for Thinking, Planning, and Communicating

This technique has helped me:

Prepare conference talks
Draft technical documentation
Plan architectural decisions
Deliver emotionally meaningful writing under pressure

It improves my thinking, not just my writing.

Final Thought: Why You Should Try It

Using ChatGPT as an interviewer isn't about outsourcing your ideas. It's about structuring your own thought process so that new insights can emerge—especially when it's hard to start, or the stakes feel high.

It turns blank pages into conversations. Questions into momentum. And ideas into something you're proud to hit "send" on.

Let's Talk About Your Next Big Idea

📬 Want to Work Smarter with AI?

If you're curious about using ChatGPT as a thinking partner—or want help getting past your next blank page with help from an actual human—I'd love to connect.

Let's chat on LinkedIn »

Originally published on jamesrcounts.com

HA/DR for Developers: Building Resilient Systems Without Losing Sleep

Jim Counts — Mon, 02 Jun 2025 00:04:11 +0000

TL;DR: Your system will fail. But that doesn't mean your weekend plans have to.

The Day I Missed Knott's Berry Farm

A few years ago, I had planned a family trip to Knott's Berry Farm with my wife and daughter. It wasn't about the destination—it was about finally taking a day off together after weeks of coordinating calendars. But at the time, I was leading platform engineering for a financial services client that was going through a rough stretch: seven production outages in ten days. None were caused by the platform, but every one of them required platform involvement to troubleshoot.

The morning of the trip, nothing happened. No outage. No red alert. But I was so rattled by the prior ten days, so worn down by the sense of looming failure, that I told my wife I couldn't risk being away. I stayed home. They went without me. And nothing happened. That kind of fear-based decision is exactly what good HA/DR should help prevent. When your systems are designed to tolerate failure, you can tolerate being offline for a day—and maybe even enjoy the ride.

DevOps Without the 2AM Alert

DevOps culture encourages ownership—but too often, that ownership comes at the cost of personal time. You promise your family a day at the amusement park, only to stay home "just in case." You wrap up work, but can't stop checking Slack. Burnout isn't just a risk—it's baked in.

It doesn't have to be. Imagine a world where your systems are resilient enough that you don't have to be. You finish early enough to catch the sunset. You take a day off without stress. You join your family, fully present—not glued to dashboards or deployments. That's not a fantasy. That's what good HA/DR design enables.

High Availability (HA) and Disaster Recovery (DR) aren't just infrastructure concerns or executive metrics—they're your best tools for building peace of mind. When implemented well, they let you ship with confidence, bounce back from failure, and stop living like you're always on call.

This post breaks down the key patterns and trade-offs of HA/DR in cloud-native environments—especially in Azure—so you can design for resilience without sabotaging your life.

📖 HA/DR Foundations: Resilience in Two Acts

When your system goes down, you need to recover. That's Disaster Recovery (DR)—the plan for getting back to production after a disruption. But wouldn't it be better if you didn't go down in the first place?

That's where High Availability (HA) comes in. HA is about designing your system so it rarely goes down. You build in redundancy, isolate failures, and keep critical services running—even when individual components falter.

In simple terms:

HA minimizes disruption.
DR minimizes downtime after disruption.

You need both. Together, they form the backbone of resilient systems—systems that bend instead of break.

💡 Think of HA as real-time protection and DR as your safety net. The stronger each is, the more confidently you can move.

🧱 High Availability Principles

High availability is the best disaster recovery—because the best outage is the one that never happens.

But designing for uptime doesn't mean aiming for perfection. It means building systems that degrade gracefully instead of collapsing completely. It's about buying time, limiting damage, and giving your team room to fix things without waking you up at 2 AM.

Here are the core principles:

Architect for continuity. Your system should be built so that it rarely goes down in the first place.
Use bulkheads. Isolate failure domains. If one component breaks, it shouldn't take the whole platform down with it.
Assume failure. Every part of your system will fail eventually. Make recovery fast, repeatable, and testable.
Degrade instead of fail. A partially working system is far better than a total outage.
Buy time. If you can contain the blast and keep core functionality up, you'll have the space to find and fix root causes without panic.

💡 High availability isn't just about uptime. It's about maintaining control when things go wrong.

🔁 Disaster Recovery Principles

If high availability is about staying up, disaster recovery is about getting back up—fast and safely—after things go wrong.

Disaster recovery (DR) is your safety net. It's what kicks in when availability fails, when the unexpected hits, or when you need to recover from corruption, deletion, or full-region outages. A good DR plan is the difference between a brief interruption and a resume-generating incident.

Two key metrics define how you recover:

RTO (Recovery Time Objective): How quickly must the system be restored?
Example: "We must be back online within 30 minutes."
RPO (Recovery Point Objective): How much data can we afford to lose?
Example: "We can only lose up to 5 minutes of data."

You don't get to pick these numbers in isolation—they come from business needs. But your architecture determines whether you can meet them.

Core DR principles:

Define realistic RTO and RPO targets. Don't guess—partner with stakeholders to understand expectations and constraints.
Automate recovery as much as possible. Manual steps add time and introduce errors under pressure.
Test your DR plan regularly. If you haven't run a recovery drill, you don't have a recovery plan—you have a document.
Keep dependencies in mind. Recovery isn't just about data—it's about DNS, identity, networking, and service interconnects.
Document and communicate. Everyone should know what to do—and what not to do—when disaster strikes.

💡 Disaster recovery isn't about avoiding failure—it's about owning it, containing it, and recovering with confidence.

🔁 Disaster Recovery Patterns: How Hot is Hot?

You've got the principles—now let's talk about what they actually look like in the real world.

In Azure (and most cloud environments), HA/DR strategies aren't just theoretical—they show up in concrete architectures. Whether you're dealing with a global SaaS app or an internal line-of-business tool, the patterns you choose will shape your system's resilience, cost, and complexity.

Let's break down the most common options, and I'll tell you which one I like best.

These are disaster recovery strategies—not high availability patterns—and that distinction matters. HA keeps your system running through localized failures. DR brings it back after major disruptions—like full-region outages or data corruption.

In Azure (and even on-premise), most DR strategies fall into one of three categories, based on how "hot" your standby environment is:

🔥 Hot/Hot

What it is: Both regions actively serve production traffic.
Recovery: Instant. Traffic reroutes automatically with little or no downtime.
Trade-offs: Highest cost, requires real-time data replication, and careful design to avoid conflicts.
When to use: You have strict RTO/RPO requirements or can't afford any downtime.

✅ Best resilience, but you're paying for it every minute.

🔥❄️ Hot/Warm

What it is: One region handles production. A second is pre-provisioned, idle, and synced—but not serving traffic.
Recovery: Minutes. Failover typically involves updating DNS or a traffic manager profile—and possibly starting services that were paused to save cost.
Trade-offs: Lower cost than hot/hot, but still requires maintenance and validation of the passive region.
When to use: You want a balance of performance, resilience, and cost.

⚖️ The sweet spot for many enterprise workloads.

❄️❄️ Hot/Cold

What it is: Only the primary region is provisioned. The secondary environment is defined but not deployed.
Recovery: Hours or more. Failover involves standing up infrastructure and restoring from backup.
Trade-offs: Cheapest option, but highest risk and slowest recovery.
When to use: You have generous RTOs or DR is only required for compliance purposes.

🧊 Better than nothing—but know what you're signing up for.

🧭 High Availability Topologies: Staying Online by Design

Not every failure is a disaster. Most of the time, staying available is about surviving smaller disruptions—like a node crash, a zone outage, or a spike in demand. That's where high availability (HA) patterns come in.

Many Azure services include built-in HA by default:

App Service Plans can span multiple Availability Zones with three or more instances.
Storage Accounts offer Locally Redundant Storage (LRS) and Zone-Redundant Storage (ZRS) to keep your data safe even when a rack or zone fails.

But when your architecture must remain available across a wider blast radius—or serve traffic across geographies—you're designing for HA at scale. Below are the core patterns:

🔄 Active/Active

What it is: Two or more regions serve live traffic simultaneously.
Benefits: Resilient, scalable, and efficient—traffic routing can be uneven; even 5–10% in a secondary region helps validate readiness.
Prerequisite: Requires Hot/Hot disaster recovery setup underneath.
When to use: You want maximum availability and live validation of multi-region readiness.

🚀 Every region earns its keep. This is resilience in action.

💤 Active/Passive

What it is: One region handles all traffic while another remains on standby.
Benefits: Simpler to operate, lower cost than active/active. Can still meet strict SLAs if DR failover is well-tested.
Watch out: Passive regions can silently drift out of date. DR drills are essential.
When to use: You need regional redundancy but can tolerate brief failover time.

🛑 Don't sleep on your passive region—test it or regret it.

Topology	Description	Recovery Time	Cost
Active/Active	Traffic split across regions	Seconds	$$$
Active/Passive (Warm)	Standby is provisioned and synced	Minutes	$$
Active/Passive (Cold)	Standby is defined but not deployed	Hours+	$

🔗 HA/DR Combinations: Which Combo Solves the Most Pain?

When you combine High Availability topologies with Disaster Recovery strategies, you get real-world deployment patterns. These combinations are where resilience, cost, and complexity converge.

✅ Active/Active with Hot/Hot

Active regions are "hot" by definition—each one processes production traffic daily.
This is the gold standard for resilience: real traffic provides real validation.
Recovery is fast because traffic can be shifted instantly using Azure Front Door, DNS, or regional load balancers.
You can optimize cost by unevenly distributing load or scaling regions independently.

💡 No surprise failovers. Every region proves it works—every day.

💤 Active/Passive Combinations

With Hot/Hot: Technically possible, but often cost inefficient—you're running two fully loaded regions, but only one serves users. You might use this if your architecture is stateful and can't yet support true Active/Active, but setting affinity at your global load balancer may be a better long-term solution. Fast failover, simple recovery, but high cost.

With Hot/Warm: A cost compromise—less expensive than Hot/Hot, but slower failover and more recovery complexity. Requires testing. Works for most teams.

With Hot/Cold: The cheapest option, but the slowest to recover—and the most likely to surprise you. Requires thorough testing. High risk if neglected.

🏆 My Recommendation: Active/Active with Hot/Hot

HA/DR Combinations

	Active/Active	Active/Passive
Hot/Hot	👑 Best resilience Live validation daily	😑 Wasted capacity Simple failover 💰 High cost
Hot/Warm	N/A	👍 Good enough Slower failover 🔁 Requires drills
Hot/Cold	N/A	💰 Cheapest Manual recovery 😩 Risky if neglected

Of all the combinations, Active/Active with Hot/Hot provides the highest level of resilience—and the most peace of mind.

When both regions handle live traffic (even unevenly), you're constantly validating that failover works. There's no guesswork, no drift, and no emergency scramble. You get elastic scale, fast recovery, and the confidence to take a day off without watching the dashboard.

✨ It's not just the most resilient option—it's the one that lets you sleep at night.

Yes, I understand this recommendation is a bit selfish—I'm optimizing for personal peace of mind alongside the greatest expense. Business needs may override that. But to quote Ferris Bueller:

"It is so choice. If you have the means, I highly recommend [Active/Active with Hot/Hot]."

❗ Objections: Why Not Be Hot?

Let's face it—when you recommend Active/Active with Hot/Hot, you're bound to get pushback. It sounds expensive, complicated, and like something only big tech companies can afford. But most of those objections don't hold up under scrutiny.

And really... why not be hot?

💸 Objection: "Active/Active with Hot/Hot is too expensive!"

Sure, on paper it looks pricey—two regions, duplicated resources, twice the infrastructure. But here's the thing:

In the cloud, you're not paying for hardware—you're paying for capacity.
Each region should be scaled to handle normal load—not peak in both places.
Shared resources (like firewalls) are duplicated in every strategy except Hot/Cold.
In Active/Active, both regions earn their keep by processing production traffic.

💡 It's not waste if it's working.

🧠 Objection: "It's too complex!"

Managing two live regions sounds hard—until you realize much of the heavy lifting is already done for you:

Many Azure services offer built-in geo-replication and zone redundancy.
Infrastructure as code (IaC) makes multi-region deployment repeatable and testable.
Automation, templates, and observability tooling eliminate most of the risk.

Plus, the payoff isn't just uptime—it's peace of mind. That's worth more than you think.

🧰 You pay Azure to simplify this complexity. Let it.

🔧 Objection: "We'd have to change the app!"

Possibly. But let's be real: if your app can't handle another region, it probably isn't handling this one very well either.

Your move to the cloud was supposed to improve elasticity and reduce CapEx.
Any app built for scalability should adapt to a second region with minimal changes.
If you're not willing to modernize the app, you're undercutting the whole value of cloud adoption.

🔁 This is a scalability issue disguised as an HA/DR objection.

Objections are natural—but they're not a reason to settle for fragile systems. If anything, they're an invitation to start a broader conversation. HA/DR isn't a one-person decision, and it isn't just a platform concern. It's a shared responsibility—one that crosses teams, roles, and org charts.

🤝 Make HA/DR a Shared Responsibility

Don't wait for someone else to own this. As a developer, you're not just writing features—you're building systems. And systems need to be resilient by design.

Your organization likely already has expectations around uptime, recovery time, and continuity. If your solution doesn't meet them, you may find yourself back at the drawing board—after the fire drill.

Start upstream. Partner with infrastructure and security teams early to understand the technical constraints.
Go beyond user stories. Talk to business stakeholders about RTO/RPO goals and the true cost of downtime.
If your team has no standards—create them. Recommend something. Personally, I like Active/Active with Hot/Hot for its clarity and resilience.
Don't skip the dry runs. Test failover scenarios before they become your next incident.

💡 HA/DR is too important to be someone else's problem. Build it into how you think.

🧘 Architect for Peace of Mind

Don't waste another minute away from the things that truly matter.

Yes, there are strong business cases for HA/DR—compliance, availability targets, reputational risk—but the most important reason is personal.

Your company can't tolerate downtime.
It's your responsibility to bring the system back up.
You want to keep your job to provide for your family.
So you stay online. You cancel the trip. You miss the recital—again.

That's a responsible decision... in the short term. But over time, it's a recipe for burnout. You don't need to choose between reliability and your life.

With the right strategy—planned, tested, and embedded in your architecture—you can walk away when you need to. You can trust that the system will hold.

💡 Architecting for uptime is architecting for peace of mind.

You don't need to build a perfect system—just one that fails gracefully, recovers fast, and lets you live your life.

✅ Key Takeaways

Failure is inevitable—design for resilience, not perfection.
Use Azure-native features like Front Door, Availability Zones, and paired regions to improve both HA and DR.
Prefer Active/Active with Hot/Hot when possible—it provides the fastest recovery and the greatest peace of mind.
Test your recovery process regularly. "It should work" ≠ "It will work."
HA/DR isn't just a technical choice—it's a quality of life investment.

📬 Want Help?

If you're trying to make your system more resilient—or just want to stop losing sleep—I'd love to talk.

Let's connect on LinkedIn »

Originally published on jamesrcounts.com.

DEV Community: Jim Counts

The Terraform Namer Pattern: Making Consistent Naming Easy at Scale

Naming Is Infrastructure

The Creeping Pain Point

The Problem: The name Property Is Too Flexible

Naming That Just Works

From Convention to Code

Even Microsoft Built a Namer

Answering Common Objections

Developers Can Still Pass Garbage Into the Namer

HashiCorp Says to Avoid Nested Modules

A Lot of Work Just to Build a String

Make the Right Thing the Easy Thing

Need Help Putting This Pattern to Work?

Why Your Terraform Platform Isn't Scaling—and What to Do About It

Production Was Perfect. Everything Else Still Ran on Tickets.

Automating the Automation Platform

Scalable Platform Automation, by Design

🧱 Root Workspace

🧭 Workspaces Workspace

🧩 Shared Modules Workspace

Getting Started: Bootstrapping the Root Workspace

Bootstrapping Steps

Bootstrapping the Workspaces Workspace

Required Credentials

Bootstrapping Steps

Creating the Shared Modules Workspace

Provisioning Steps

Proven in the Field

What You Might Be Thinking

Why not just use one workspace?

Isn't this overkill for a small team?

Isn't it risky to let Terraform manage service principals?

Isn't this tied too tightly to Terraform Cloud?

Doesn't this just recreate the same internal platform that already frustrated us?

Ready When You Are

How I Use ChatGPT to Interview Myself (and Why You Should Too)

It Started With Blank Pages

Why Interviewing Works Better Than Drafting

From Prompt to Blog Post: A Real Example

Tips for Getting the Most From This Technique

A Tool for Thinking, Planning, and Communicating

Final Thought: Why You Should Try It

Let's Talk About Your Next Big Idea

HA/DR for Developers: Building Resilient Systems Without Losing Sleep

The Day I Missed Knott's Berry Farm

DevOps Without the 2AM Alert

📖 HA/DR Foundations: Resilience in Two Acts

🧱 High Availability Principles

🔁 Disaster Recovery Principles

🔁 Disaster Recovery Patterns: How Hot is Hot?

🔥 Hot/Hot

🔥❄️ Hot/Warm

❄️❄️ Hot/Cold

🧭 High Availability Topologies: Staying Online by Design

🔄 Active/Active

💤 Active/Passive

🔗 HA/DR Combinations: Which Combo Solves the Most Pain?

✅ Active/Active with Hot/Hot

💤 Active/Passive Combinations

🏆 My Recommendation: Active/Active with Hot/Hot

HA/DR Combinations

❗ Objections: Why Not Be Hot?

💸 Objection: "Active/Active with Hot/Hot is too expensive!"

🧠 Objection: "It's too complex!"

🔧 Objection: "We'd have to change the app!"

🤝 Make HA/DR a Shared Responsibility

🧘 Architect for Peace of Mind

✅ Key Takeaways

📬 Want Help?

The Problem: The `name` Property Is Too Flexible