🎬 The Scene: It's Monday Morning...
You open the Azure portal. There are 47 resource groups. Nobody knows who created 23 of them. There's a VM called test-final-v2-REAL-final running since 2024. Someone deployed a $800/month App Gateway for a dev environment. The tagging strategy? What tagging strategy?
Sound familiar?
Welcome to Azure Cloud Architecture Therapy — where we turn your chaotic cloud into something a Principal Engineer would be proud of. Grab coffee. This is going to be fun.
🏗️ First: How Azure Actually Works (The 2-Minute Version)
Before we fix anything, let's understand the plumbing. Every single thing you do in Azure — whether you're clicking buttons in the portal or running terraform apply — goes through one gateway:
You → Azure Resource Manager (ARM) → The Actual Resource
ARM is the bouncer at the club. It checks:
- Who are you? (Authentication via Entra ID)
- Can you do this? (Authorization via RBAC)
- Should we let this through? (Policies & throttle limits)
- OK, forwarding to the bartender (Resource Provider)
🚨 Real-World Disaster #1: ARM Throttling
The Error:
Status=429 Code="TooManyRequests"
Message="The request was throttled. Retry after 37 seconds"
What Happened: A team ran terraform plan on a monolithic root module with 2,000+ resources. ARM limits you to 12,000 read requests/hour and 1,200 write requests/hour per subscription. Their plan consumed the entire read budget, blocking other teams' deployments.
The Fix:
- Split infrastructure across multiple subscriptions (not just resource groups)
- Break that mega Terraform root module into smaller state files
- Use
terraform plan -parallelism=5instead of the default 10 - Schedule pipeline runs to avoid peak hours
💡 Principal Insight: ARM throttling is the #1 reason to adopt a multi-subscription strategy. If you think "we'll just use one subscription" — you haven't hit scale yet.
🗂️ Organizing Your Azure: The Management Group Hierarchy
Think of Azure organization like a company org chart, except everyone actually follows it (unlike real company org charts):
Tenant Root Group (The CEO nobody talks to)
├── Platform (The boring-but-essential stuff)
│ ├── Identity Subscription (AD DS, DNS, PKI)
│ ├── Management Subscription (Log Analytics, Monitoring)
│ └── Connectivity Subscription (Hub Network, Firewall, VPN)
│
├── Landing Zones (Where the real work happens)
│ ├── Corp (Internal apps — no internet exposure)
│ │ ├── team-alpha-subscription
│ │ └── team-bravo-subscription
│ └── Online (Internet-facing apps)
│ ├── public-web-app-subscription
│ └── api-platform-subscription
│
├── Sandbox (The "break stuff here" zone)
│ └── dev-playground-subscription
│
└── Decommissioned (The graveyard. RIP test-final-v2.)
└── old-projects-subscription
Which Subscription Pattern Should You Use?
| Pattern | Best For | Gotcha |
|---|---|---|
| App-per-subscription | Large orgs, strict isolation | Too many subscriptions to manage without automation |
| Environment-per-sub | Medium orgs | Apps from 15 teams sharing a "prod" subscription = chaos |
| Team-per-subscription | Autonomy-focused orgs | Cross-team app dependencies get messy |
| Workload-per-subscription | CAF recommended | Requires solid IaC automation |
🚨 Real-World Disaster #2: The "One Subscription to Rule Them All"
What Happened: A fintech startup put everything — dev, staging, prod, the CEO's demo environment — into one subscription. An intern with Contributor role on the subscription accidentally deleted the production resource group.
Yes, the production resource group. On a Tuesday.
The Fix:
- Separate subscriptions for prod vs. non-prod (at minimum)
- Azure Resource Locks on production resource groups:
az lock create --name "CannotDelete" \
--lock-type CanNotDelete \
--resource-group rg-payments-prod-eastus
- PIM (Privileged Identity Management) for elevated access — no one gets permanent Owner
- Delete locks + RBAC deny assignments for dangerous operations
🏷️ Naming & Tagging: The Unsexy Topic That Saves Your Career
I know, I know. Naming conventions. Exciting as watching paint dry. But here's the thing — when it's 2 AM and you're debugging a production issue, the difference between rg-payments-prod-eastus-001 and myResourceGroup7 is the difference between finding the problem and updating your LinkedIn.
The Naming Pattern
{resource-type}-{workload}-{environment}-{region}-{instance}
Examples:
rg-payments-prod-eastus-001 ← I know exactly what this is
aks-payments-prod-eastus-001 ← AKS cluster for payments, prod
kv-payments-prod-eastus-001 ← Key Vault
stpaymentsprodeastus001 ← Storage (no hyphens allowed, thanks Azure 🙄)
Mandatory Tags (Enforce With Azure Policy)
| Tag | Why You Need It At 3 AM |
|---|---|
environment |
"Is this prod or dev?" — crucial before you kubectl delete
|
owner |
"Who do I page?" |
cost-center |
"Who's paying for this $3,000/month GPU VM?" |
application |
"Which app does this belong to?" |
data-classification |
"Can I share this log with the vendor?" |
created-by |
"Did Terraform create this or did someone ClickOps it?" |
🚨 Real-World Disaster #3: The $47,000 Mystery Bill
The Error: Finance escalates that Azure spend jumped $47K in one month. Nobody knows why.
Root Cause: A performance test spun up 50 Standard_E64s_v5 VMs (64 vCPU, 512 GB RAM each) with no auto-shutdown and no cost tags. The test ran on a Friday. Nobody noticed until billing closed.
The Fix:
- Azure Policy to deny resource creation without required tags
- Cost anomaly alerts at subscription and resource group level
- Auto-shutdown policy for dev/test VMs
- Tag-based cost reporting in Azure Cost Management
// Azure Policy: Require 'cost-center' tag
{
"if": {
"field": "[concat('tags[', 'cost-center', ']')]",
"exists": "false"
},
"then": {
"effect": "deny"
}
}
🌐 Networking: Where Dreams Go to Die
Azure networking is where even senior engineers start sweating. Let's make it simple.
Hub-Spoke: The Pattern You'll Use 90% of the Time
The Internet
│
┌──────▼──────┐
│ Hub VNet │ ← Firewall, VPN/ExpressRoute, DNS
└──────┬───────┘
│
┌───────┼───────┐
▼ ▼ ▼
Spoke 1 Spoke 2 Spoke 3
(App A) (App B) (Shared)
The Hub = Your security checkpoint. All traffic flows through here.
Spokes = Where your applications live. Isolated from each other.
The Zero-Trust Commandments
- NO public endpoints on backend services. Period.
- Private Endpoints for every PaaS service (SQL, Key Vault, Storage, ACR)
- Service endpoints are the poor man's Private Endpoints — use them only when budget is truly tight
- All traffic stays on the Microsoft backbone network
🚨 Real-World Disaster #4: The "Publicly Exposed SQL Server"
The Alert:
Microsoft Defender for Cloud: CRITICAL
"Azure SQL Server has public network access enabled"
"3,847 failed login attempts from IP: 185.x.x.x in the last hour"
What Happened: A developer enabled "Allow Azure services" on an Azure SQL Server "just for testing" and never turned it off. This essentially opens your SQL to any Azure IP — including attacker VMs running in Azure.
The Fix:
# Disable public access
az sql server update --name sql-prod --resource-group rg-app \
--public-network-access Disabled
# Use Private Endpoint instead
az network private-endpoint create \
--name pe-sql-prod \
--resource-group rg-app \
--vnet-name vnet-spoke-app \
--subnet snet-data \
--private-connection-resource-id /subscriptions/.../sql-prod \
--group-id sqlServer \
--connection-name sql-private-connection
DNS with Private Endpoints (The Part Everyone Gets Wrong)
When you create a Private Endpoint, you need DNS to resolve the service name to the private IP, not the public IP. This trips up EVERYONE.
What should happen:
sql-prod.database.windows.net
→ CNAME → sql-prod.privatelink.database.windows.net
→ A record → 10.0.5.4 (Private IP in your VNet)
What goes wrong:
"I created the Private Endpoint but my app still connects to the public IP!"
→ You forgot to create the Private DNS Zone and link it to your VNet
The checklist:
- Create Private Endpoint ✅
- Create Private DNS Zone (e.g.,
privatelink.database.windows.net) ✅ - Link DNS Zone to your Hub VNet (and spoke VNets) ✅
- DNS records auto-populate ✅
- Test from inside the VNet:
nslookup sql-prod.database.windows.net✅
🔐 Identity: Stop Using Passwords. Like, Yesterday.
This is 2026. If your applications are still connecting to Azure resources with connection strings that have passwords in them, we need to have a serious conversation.
The Identity Hierarchy
🏆 Tier 1: Managed Identity (BEST — no credentials at all)
App → Azure Resource, zero secrets involved
🥈 Tier 2: Workload Identity Federation (K8s pods → Azure)
Pod → Federated Token → Azure Resource
🥉 Tier 3: OIDC Federation (CI/CD → Azure)
Pipeline → Short-lived token → Azure Resource
💀 Tier Last: Service Principal + Client Secret
"We rotated the secret and broke prod at 4 AM"
🚨 Real-World Disaster #5: The Expired Service Principal
The 3 AM PagerDuty Alert:
CRITICAL: Deployment pipeline failed
Error: AADSTS7000222: The provided client secret keys for app
'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' are expired.
What Happened: A service principal secret was set to expire in 6 months. Nobody set up a reminder. 6 months passed. Production deployment pipeline stopped working. Release blocked for 4 hours while someone figured out how to rotate the secret without breaking other services using it.
The Fix: Stop using client secrets entirely.
# For pipelines: Use OIDC federation (no secrets!)
az ad app federated-credential create \
--id <app-object-id> \
--parameters '{
"name": "github-main-branch",
"issuer": "https://token.actions.githubusercontent.com",
"subject": "repo:myorg/myrepo:ref:refs/heads/main",
"audiences": ["api://AzureADTokenExchange"]
}'
# For Azure resources: Use Managed Identity
az webapp identity assign --name myapp --resource-group rg-prod
🧮 Choosing Your Compute Platform
Every week someone asks: "Should we use AKS or App Service?" Here's the cheat sheet:
| Need | Use This | Why |
|---|---|---|
| "We have microservices and K8s expertise" | AKS | Full control, service mesh, custom operators |
| "Simple web app, REST API" | App Service | Managed, easy, cost-effective |
| "Containers but no K8s pls" | Container Apps | Serverless containers, KEDA built-in |
| "Event-driven, sporadic traffic" | Azure Functions | Scale-to-zero, pay-per-execution |
| "We need GPUs" | AKS (GPU node pools) | Only K8s gives you GPU scheduling flexibility |
| "Legacy .NET app" | App Service | Or containerize it for Container Apps |
🚨 Real-World Disaster #6: The Over-Engineered Startup
The Situation: A 4-person startup with one API and one frontend deployed to a 3-node AKS cluster with Istio service mesh, Prometheus, Grafana, Kyverno, and ArgoCD. Monthly cloud bill: $2,800. Total users: 47.
The Fix: Migrated to Azure Container Apps. Monthly bill: $12.
💡 Principal Insight: The right tool depends on your actual needs, not your resume aspirations. AKS is the right call when you have the scale and team to justify it. For everything else, there's simpler options.
💰 FinOps: Because Money Is a Feature
Cloud cost isn't someone else's problem. At the Principal level, cost optimization is part of your architecture decisions.
Quick Wins
| Action | Typical Savings |
|---|---|
| Right-size VMs (Azure Advisor recommendations) | 20-40% |
| Reserved Instances (1-3 year commit) | 30-72% |
| Spot VMs for batch/test workloads | 60-90% |
| Auto-shutdown for dev/test | 40-60% |
| Storage lifecycle policies (hot → cool → archive) | 50-80% on storage |
| Delete orphaned disks, IPs, load balancers | Immediate savings |
The FinOps Command You Should Run Right Now
# Find orphaned resources (no associated resource)
az disk list --query "[?managedBy==null].{Name:name, Size:diskSizeGb, RG:resourceGroup}" -o table
az network public-ip list --query "[?ipConfiguration==null].{Name:name, RG:resourceGroup}" -o table
I guarantee you'll find at least 3 orphaned disks you're paying for right now. Go check. I'll wait. ☕
🎯 Key Takeaways
- ARM throttling is real — design for multi-subscription from the start
- Management groups + Landing Zones = the foundation of enterprise Azure
- Tag everything or drown in mystery costs
- Private Endpoints everywhere — no public backends, no exceptions
- Managed Identity > Workload Identity > OIDC > ... > secrets (secrets are the worst)
- Pick the right compute — don't bring AKS to a Container Apps fight
- FinOps is architecture — cost is a first-class design requirement
🔥 Homework
- Run the orphaned disk command above. Screenshot the results (I dare you to have zero).
- Check if ANY of your production SQL databases have public network access. Fix them.
- Find one service principal with an expired or expiring secret. Replace it with Managed Identity or OIDC.
Next up in the series: **Kubernetes: The Drama of Pods, Nodes, and the Scheduler Who Hates Everyone* — where we decode K8s internals, real production meltdowns, and why your pod keeps getting OOMKilled at 2 AM.*
💬 Drop a comment if you've survived any of these disasters. Bonus points if your war story is worse. (I know it is.)
Top comments (1)
Splitting into multiple subscriptions to avoid ARM throttling makes sense technically, but doesn't the billing complexity and cross-subscription networking overhead negate some of those gains for smaller teams?