S, Sanjay

Posted on Mar 18

Why Your Azure Subscription Looks Like a Teenager's Bedroom (And How to Fix It)

#azure #cloud #devops #architecture

🎬 The Scene: It's Monday Morning...

You open the Azure portal. There are 47 resource groups. Nobody knows who created 23 of them. There's a VM called test-final-v2-REAL-final running since 2024. Someone deployed a $800/month App Gateway for a dev environment. The tagging strategy? What tagging strategy?

Sound familiar?

Welcome to Azure Cloud Architecture Therapy — where we turn your chaotic cloud into something a Principal Engineer would be proud of. Grab coffee. This is going to be fun.

🏗️ First: How Azure Actually Works (The 2-Minute Version)

Before we fix anything, let's understand the plumbing. Every single thing you do in Azure — whether you're clicking buttons in the portal or running terraform apply — goes through one gateway:

You → Azure Resource Manager (ARM) → The Actual Resource

ARM is the bouncer at the club. It checks:

Who are you? (Authentication via Entra ID)
Can you do this? (Authorization via RBAC)
Should we let this through? (Policies & throttle limits)
OK, forwarding to the bartender (Resource Provider)

🚨 Real-World Disaster #1: ARM Throttling

The Error:

Status=429 Code="TooManyRequests"
Message="The request was throttled. Retry after 37 seconds"

What Happened: A team ran terraform plan on a monolithic root module with 2,000+ resources. ARM limits you to 12,000 read requests/hour and 1,200 write requests/hour per subscription. Their plan consumed the entire read budget, blocking other teams' deployments.

The Fix:

Split infrastructure across multiple subscriptions (not just resource groups)
Break that mega Terraform root module into smaller state files
Use terraform plan -parallelism=5 instead of the default 10
Schedule pipeline runs to avoid peak hours

💡 Principal Insight: ARM throttling is the #1 reason to adopt a multi-subscription strategy. If you think "we'll just use one subscription" — you haven't hit scale yet.
⚡ Real talk for small teams: If you have < 500 resources and < 10 engineers, you'll never hit ARM throttling. One subscription with separate resource groups per environment is perfectly fine. Graduate to multi-subscription when Terraform plans start timing out, teams block each other's deployments, or compliance mandates prod isolation. Multi-subscription is the right destination, not the starting point — start simple, graduate when the pain is real. 😄

🗂️ Organizing Your Azure: The Management Group Hierarchy

Think of Azure organization like a company org chart, except everyone actually follows it (unlike real company org charts):

Tenant Root Group (The CEO nobody talks to)
├── Platform (The boring-but-essential stuff)
│   ├── Identity Subscription (AD DS, DNS, PKI)
│   ├── Management Subscription (Log Analytics, Monitoring)
│   └── Connectivity Subscription (Hub Network, Firewall, VPN)
│
├── Landing Zones (Where the real work happens)
│   ├── Corp (Internal apps — no internet exposure)
│   │   ├── team-alpha-subscription
│   │   └── team-bravo-subscription
│   └── Online (Internet-facing apps)
│       ├── public-web-app-subscription
│       └── api-platform-subscription
│
├── Sandbox (The "break stuff here" zone)
│   └── dev-playground-subscription
│
└── Decommissioned (The graveyard. RIP test-final-v2.)
    └── old-projects-subscription

Which Subscription Pattern Should You Use?

Pattern	Best For	Gotcha
App-per-subscription	Large orgs, strict isolation	Too many subscriptions to manage without automation
Environment-per-sub	Medium orgs	Apps from 15 teams sharing a "prod" subscription = chaos
Team-per-subscription	Autonomy-focused orgs	Cross-team app dependencies get messy
Workload-per-subscription	CAF recommended	Requires solid IaC automation

🚨 Real-World Disaster #2: The "One Subscription to Rule Them All"

What Happened: A fintech startup put everything — dev, staging, prod, the CEO's demo environment — into one subscription. An intern with Contributor role on the subscription accidentally deleted the production resource group.

Yes, the production resource group. On a Tuesday.

The Fix:

Separate subscriptions for prod vs. non-prod (at minimum)
Azure Resource Locks on production resource groups:

az lock create --name "CannotDelete" \
  --lock-type CanNotDelete \
  --resource-group rg-payments-prod-eastus

PIM (Privileged Identity Management) for elevated access — no one gets permanent Owner
Delete locks + RBAC deny assignments for dangerous operations

🏷️ Naming & Tagging: The Unsexy Topic That Saves Your Career

I know, I know. Naming conventions. Exciting as watching paint dry. But here's the thing — when it's 2 AM and you're debugging a production issue, the difference between rg-payments-prod-eastus-001 and myResourceGroup7 is the difference between finding the problem and updating your LinkedIn.

The Naming Pattern

{resource-type}-{workload}-{environment}-{region}-{instance}

Examples:
  rg-payments-prod-eastus-001        ← I know exactly what this is
  aks-payments-prod-eastus-001       ← AKS cluster for payments, prod
  kv-payments-prod-eastus-001        ← Key Vault
  stpaymentsprodeastus001            ← Storage (no hyphens allowed, thanks Azure 🙄)

Mandatory Tags (Enforce With Azure Policy)

Tag	Why You Need It At 3 AM
`environment`	"Is this prod or dev?" — crucial before you `kubectl delete`
`owner`	"Who do I page?"
`cost-center`	"Who's paying for this $3,000/month GPU VM?"
`application`	"Which app does this belong to?"
`data-classification`	"Can I share this log with the vendor?"
`created-by`	"Did Terraform create this or did someone ClickOps it?"

🚨 Real-World Disaster #3: The $47,000 Mystery Bill

The Error: Finance escalates that Azure spend jumped $47K in one month. Nobody knows why.

Root Cause: A performance test spun up 50 Standard_E64s_v5 VMs (64 vCPU, 512 GB RAM each) with no auto-shutdown and no cost tags. The test ran on a Friday. Nobody noticed until billing closed.

The Fix:

Azure Policy to deny resource creation without required tags
Cost anomaly alerts at subscription and resource group level
Auto-shutdown policy for dev/test VMs
Tag-based cost reporting in Azure Cost Management

// Azure Policy: Require 'cost-center' tag
{
  "if": {
    "field": "[concat('tags[', 'cost-center', ']')]",
    "exists": "false"
  },
  "then": {
    "effect": "deny"
  }
}

🌐 Networking: Where Dreams Go to Die

Azure networking is where even senior engineers start sweating. Let's make it simple.

Hub-Spoke: The Pattern You'll Use 90% of the Time

        The Internet
            │
     ┌──────▼──────┐
     │   Hub VNet   │ ← Firewall, VPN/ExpressRoute, DNS
     └──────┬───────┘
            │
    ┌───────┼───────┐
    ▼       ▼       ▼
  Spoke 1  Spoke 2  Spoke 3
  (App A)  (App B)  (Shared)

The Hub = Your security checkpoint. All traffic flows through here.
Spokes = Where your applications live. Isolated from each other.

The Zero-Trust Commandments

NO public endpoints on backend services. Period.
Private Endpoints for every PaaS service (SQL, Key Vault, Storage, ACR)
Service endpoints are the poor man's Private Endpoints — use them only when budget is truly tight
All traffic stays on the Microsoft backbone network

🚨 Real-World Disaster #4: The "Publicly Exposed SQL Server"

The Alert:

Microsoft Defender for Cloud: CRITICAL
"Azure SQL Server has public network access enabled"
"3,847 failed login attempts from IP: 185.x.x.x in the last hour"

What Happened: A developer enabled "Allow Azure services" on an Azure SQL Server "just for testing" and never turned it off. This essentially opens your SQL to any Azure IP — including attacker VMs running in Azure.

The Fix:

# Disable public access
az sql server update --name sql-prod --resource-group rg-app \
  --public-network-access Disabled

# Use Private Endpoint instead
az network private-endpoint create \
  --name pe-sql-prod \
  --resource-group rg-app \
  --vnet-name vnet-spoke-app \
  --subnet snet-data \
  --private-connection-resource-id /subscriptions/.../sql-prod \
  --group-id sqlServer \
  --connection-name sql-private-connection

DNS with Private Endpoints (The Part Everyone Gets Wrong)

When you create a Private Endpoint, you need DNS to resolve the service name to the private IP, not the public IP. This trips up EVERYONE.

What should happen:
  sql-prod.database.windows.net
    → CNAME → sql-prod.privatelink.database.windows.net
    → A record → 10.0.5.4 (Private IP in your VNet)

What goes wrong:
  "I created the Private Endpoint but my app still connects to the public IP!"
  → You forgot to create the Private DNS Zone and link it to your VNet

The checklist:

Create Private Endpoint ✅
Create Private DNS Zone (e.g., privatelink.database.windows.net) ✅
Link DNS Zone to your Hub VNet (and spoke VNets) ✅
DNS records auto-populate ✅
Test from inside the VNet: nslookup sql-prod.database.windows.net ✅

🔐 Identity: Stop Using Passwords. Like, Yesterday.

This is 2026. If your applications are still connecting to Azure resources with connection strings that have passwords in them, we need to have a serious conversation.

The Identity Hierarchy

🏆 Tier 1: Managed Identity (BEST — no credentials at all)
   App → Azure Resource, zero secrets involved

🥈 Tier 2: Workload Identity Federation (K8s pods → Azure)
   Pod → Federated Token → Azure Resource

🥉 Tier 3: OIDC Federation (CI/CD → Azure)
   Pipeline → Short-lived token → Azure Resource

💀 Tier Last: Service Principal + Client Secret
   "We rotated the secret and broke prod at 4 AM"

🚨 Real-World Disaster #5: The Expired Service Principal

The 3 AM PagerDuty Alert:

CRITICAL: Deployment pipeline failed
Error: AADSTS7000222: The provided client secret keys for app
'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' are expired.

What Happened: A service principal secret was set to expire in 6 months. Nobody set up a reminder. 6 months passed. Production deployment pipeline stopped working. Release blocked for 4 hours while someone figured out how to rotate the secret without breaking other services using it.

The Fix: Stop using client secrets entirely.

# For pipelines: Use OIDC federation (no secrets!)
az ad app federated-credential create \
  --id <app-object-id> \
  --parameters '{
    "name": "github-main-branch",
    "issuer": "https://token.actions.githubusercontent.com",
    "subject": "repo:myorg/myrepo:ref:refs/heads/main",
    "audiences": ["api://AzureADTokenExchange"]
  }'

# For Azure resources: Use Managed Identity
az webapp identity assign --name myapp --resource-group rg-prod

🧮 Choosing Your Compute Platform

Every week someone asks: "Should we use AKS or App Service?" Here's the cheat sheet:

Need	Use This	Why
"We have microservices and K8s expertise"	AKS	Full control, service mesh, custom operators
"Simple web app, REST API"	App Service	Managed, easy, cost-effective
"Containers but no K8s pls"	Container Apps	Serverless containers, KEDA built-in
"Event-driven, sporadic traffic"	Azure Functions	Scale-to-zero, pay-per-execution
"We need GPUs"	AKS (GPU node pools)	Only K8s gives you GPU scheduling flexibility
"Legacy .NET app"	App Service	Or containerize it for Container Apps

🚨 Real-World Disaster #6: The Over-Engineered Startup

The Situation: A 4-person startup with one API and one frontend deployed to a 3-node AKS cluster with Istio service mesh, Prometheus, Grafana, Kyverno, and ArgoCD. Monthly cloud bill: $2,800. Total users: 47.

The Fix: Migrated to Azure Container Apps. Monthly bill: $12.

💡 Principal Insight: The right tool depends on your actual needs, not your resume aspirations. AKS is the right call when you have the scale and team to justify it. For everything else, there's simpler options.

💰 FinOps: Because Money Is a Feature

Cloud cost isn't someone else's problem. At the Principal level, cost optimization is part of your architecture decisions.

Quick Wins

Action	Typical Savings
Right-size VMs (Azure Advisor recommendations)	20-40%
Reserved Instances (1-3 year commit)	30-72%
Spot VMs for batch/test workloads	60-90%
Auto-shutdown for dev/test	40-60%
Storage lifecycle policies (hot → cool → archive)	50-80% on storage
Delete orphaned disks, IPs, load balancers	Immediate savings

The FinOps Command You Should Run Right Now

# Find orphaned resources (no associated resource)
az disk list --query "[?managedBy==null].{Name:name, Size:diskSizeGb, RG:resourceGroup}" -o table
az network public-ip list --query "[?ipConfiguration==null].{Name:name, RG:resourceGroup}" -o table

I guarantee you'll find at least 3 orphaned disks you're paying for right now. Go check. I'll wait. ☕

🎯 Key Takeaways

ARM throttling is real — design for multi-subscription from the start
Management groups + Landing Zones = the foundation of enterprise Azure
Tag everything or drown in mystery costs
Private Endpoints everywhere — no public backends, no exceptions
Managed Identity > Workload Identity > OIDC > ... > secrets (secrets are the worst)
Pick the right compute — don't bring AKS to a Container Apps fight
FinOps is architecture — cost is a first-class design requirement

🔥 Homework

Run the orphaned disk command above. Screenshot the results (I dare you to have zero).
Check if ANY of your production SQL databases have public network access. Fix them.
Find one service principal with an expired or expiring secret. Replace it with Managed Identity or OIDC.

Next up in the series: **Kubernetes: The Drama of Pods, Nodes, and the Scheduler Who Hates Everyone* — where we decode K8s internals, real production meltdowns, and why your pod keeps getting OOMKilled at 2 AM.*

💬 Drop a comment if you've survived any of these disasters. Bonus points if your war story is worse. (I know it is.)

Top comments (2)

klement Gunndu • Mar 18

Splitting into multiple subscriptions to avoid ARM throttling makes sense technically, but doesn't the billing complexity and cross-subscription networking overhead negate some of those gains for smaller teams?

S, Sanjay • Mar 18

You're 100% right — multi-subscription is overkill for small teams.
If you have < 500 resources and < 10 engineers, you'll never hit ARM throttling. At that scale, multiple subscriptions just add billing headaches, cross-sub networking complexity, and context-switching friction that kills velocity.

The graduation path:
Small team: 1 subscription, separate resource groups per env
Growing team: 2 subscriptions (prod vs non-prod)
Enterprise: Subscription per workload (CAF Landing Zone model)

When to graduate: You'll know it's time when Terraform plans start timing out, teams block each other's deployments, or compliance mandates prod isolation.

TL;DR: Multi-subscription is the right destination, not the starting point. Start simple, graduate when the pain is real — not when a blog tells you to. Even mine. 😄