DEV Community

Cover image for Tag Governance at Scale: How to Build a Cloud Tagging Strategy That Actually Sticks
Muskan
Muskan

Posted on

Tag Governance at Scale: How to Build a Cloud Tagging Strategy That Actually Sticks

Tag Governance at Scale: How to Build a Cloud Tagging Strategy That Actually Sticks

The $231 Billion Visibility Problem

Global cloud spending hit $723 billion in 2025. Organizations wasted 32% of it. That number has barely moved in six years: 30% in 2019, 32% in 2021, 27% in 2025. Despite better tooling, better awareness, and more FinOps practitioners than ever, the waste rate is essentially flat.

Year Cloud Waste Rate (Flexera)
2019 30%
2020 30%
2021 32%
2022 28%
2023 28%
2024 27%
2025 27%

This isn't a tooling problem. It's a visibility problem. When more than 20% of cloud spend lacks tags, cost identification breaks down because you can't attribute charges to teams, workloads, or products. You see a line item in a billing report, but no owner, no environment, no business context. You can't optimize what you can't identify.

The Drift case is instructive. Their DevOps team built manual tagging processes on AWS to track spend. It wasn't enough. Costs from machine learning features were unpredictable, attribution was impossible, and the team ended up spending roughly 80% of their time on cost reduction efforts, with no clear picture of which products or features were driving the bill. The issue wasn't effort. It was that inconsistent tagging had made the data structurally untrustworthy.

This is the starting condition for most engineering organizations past 50 people. And the solution isn't more tagging. It's a governance system that makes tagging the path of least resistance.

Why Tagging Strategies Die in Year One

Most tagging initiatives fail the same way. The platform team ships a policy document with 15 required tags. Engineers read it once, ignore it under deadline pressure, and provision resources the way they always have. Six months later, coverage is at 40%, the policy is informally abandoned, and the next initiative starts from scratch.

The root causes are consistent:

Too many required fields. When the schema starts with 20 mandatory tags, teams route around the complexity. Coverage drops because compliance feels impossible. A practical starting point is 5-8 mandatory tags. Everything else is optional.

Manual discipline doesn't scale. Engineers under sprint pressure skip steps that don't block deploys. If you can create a resource without tagging it, most people eventually will. Policy documents are suggestions. Enforced gates are policy.

Templates predate the policy. Even in mature IaC environments, Terraform modules and CloudFormation templates often omit tag blocks because they were authored before the governance policy existed. A new policy doesn't automatically retrofit old infrastructure.

Runtime drift. Cloud providers themselves modify resources. Azure Policy's modify and deployIfNotExists effects can inject tags or change settings after creation. CI/CD pipelines can add or overwrite tag values. If those changes aren't reflected in the IaC source, the code and the live environment diverge silently.

Each of these failure modes is solvable. But they require a system, not a document.

Design the Schema Before You Touch a Policy

The tagging schema is the foundation. Get it wrong and enforcement will either be too restrictive to adopt or too loose to be useful.

Start with a small mandatory set. Every resource across every cloud account should carry these keys from day one:

Tag Key Purpose Unlocks
environment prod, staging, dev, qa Cost separation, lifecycle automation
owner Team or individual responsible Incident routing, waste attribution
cost_center Business unit or team ID Chargeback, showback
application Logical app or service name Cross-resource cost rollups
team Owning engineering team Dashboard filtering, alerting

Beyond these five, add optional tags for specific use cases: data_classification for security policy enforcement, backup_schedule for operations automation, project for temporary workloads with defined end dates.

A few design rules that matter at scale:

Keep keys provider-agnostic. Using awsAccountId as a tag key instead of owner bakes in a cloud-specific assumption. When your GCP or Azure footprint grows, cross-cloud reporting breaks. Neutral keys work everywhere.

Standardize values, not just keys. environment: Prod, environment: production, and environment: PRODUCTION are three different values in cost filters and policy rules. Define an allowed values list for high-cardinality keys like environment and enforce it.

Avoid tag sprawl. Teams naturally accumulate tags over time: experiment tags, one-off project tags, tags added for a specific audit. Review the tag schema annually and remove keys with low adoption or no downstream use.

Enforcement: From Suggestion to System

A tagging schema without enforcement is a wishlist. Enforcement means resources cannot be created without required tags, and resources that drift out of compliance are detected and flagged automatically.

The enforcement stack differs by provider:

On AWS, Service Control Policies (SCPs) can deny ec2:RunInstances, s3:CreateBucket, and other provisioning API calls unless required tags are present in the request. AWS Organizations Tag Policies enforce allowed values and generate compliance reports at the account level. AWS Config rules run continuously and flag resources that are out of compliance post-deployment.

On Azure, Azure Policy with a deny effect blocks resource creation that violates tag requirements. The modify effect can automatically inherit tags from resource groups to child resources, which reduces the burden on individual provisioning steps.

On GCP, Organization Policy constraints provide less native tag enforcement compared to AWS and Azure. The practical approach is validation in CI/CD pipelines using tools like OPA (Open Policy Agent) combined with Terraform plan-stage checks. This catches violations before terraform apply runs.

IaC-first enforcement is more reliable than post-deployment scanning alone. If you only scan after deployment, you're remediating after the fact. Enforcement at the plan-and-deploy stage prevents untagged resources from entering the environment in the first place.

One important exception: existing resources created before the policy existed. These need a remediation path. Automated scanners (AWS Config, Azure Resource Graph, GCP Cloud Asset Inventory) can produce lists of non-compliant resources. Assign ownership of the remediation backlog to specific teams with a deadline, and track coverage weekly in a shared dashboard.

Rolling It Out Without Stopping Engineering

Enforcement without a rollout plan creates friction that poisons adoption. The goal is to reach 95% compliance without blocking a single production deploy.

Start in sandbox accounts. Enable enforcement, measure coverage, and let teams encounter the policy in a low-stakes environment. This surfaces edge cases: automation scripts that create resources without tags, third-party tools that don't support tag passthrough, and legacy modules that need updating.

Move to development in week two. By now, teams have seen the enforcement mechanism and know what's expected. Measure coverage weekly, not monthly. A coverage dashboard visible to both engineering teams and leadership changes behavior faster than policy reminders. When a team can see their environment has 62% coverage against an org average of 83%, that gap drives action.

Before enabling enforcement in production, establish an exemption process. Engineers need a path to deploy a critical fix without being blocked by a tag validation error at 2 am. The exemption should require a ticket, a 48-hour window, and automatic remediation afterward. Without this escape valve, teams will route around enforcement rather than work with it.

What 95% Actually Looks Like (and Why 100% Is a Trap)

The last 5% of tag coverage is not a governance failure. It's a cloud billing reality.

Some cost categories don't support resource-level tagging. These include AWS data transfer charges, certain networking costs on GCP, and some legacy Azure classic resources. No matter how rigorous the tagging policy is, these costs will appear as untagged in cost exports.

The way to handle them is billing-level allocation, not resource tagging:

Provider Un-Taggable Cost Type Allocation Tool
AWS Data transfer, support charges AWS Cost Categories
Azure Classic resources, tenant-level charges Azure cost allocation rules
GCP Networking, shared services BigQuery billing export with custom allocation logic

Define what 95% means for your organization before you start measuring. If 5% of your spend is structurally untaggable, then 95% tag coverage on taggable resources is effectively 100% compliance. Chasing the last 5% with increasingly complex workarounds is a poor use of engineering time.

When coverage reaches 90% and above, the downstream benefits compound. Cost attribution is reliable enough for accurate chargeback. Automation policies (shutting down dev environments overnight, deleting untagged resources after 30 days, enforcing backup schedules by environment) become trustworthy. Security teams can write access policies that reference tags rather than hardcoded resource ARNs or IDs.

Tagging done well isn't metadata hygiene. It's the operational substrate that makes everything else in cloud governance actually work.

Top comments (0)