Reflecting on two decades of leading technology transformations. Global FinTech SaaS platforms processing millions of transactions. AI-driven automation that reclaimed hundreds of person-days. High-pressure private equity exits. Kubernetes clusters that bridged the gap from legacy to cloud-native. Terraform state files that moved the needle for global consultancies.
I’ve seen the "cloud-native" definition shift multiple times. This is the post I wish someone had written for me when I started my journey through the ranks. moving through roles as a DevOps Engineer, an Engineering Lead and now Senior DevOps Manager, I’ve navigated the Azure ecosystem's growth from a "Windows-first" cloud to a powerhouse of Linux, Kubernetes, and AI.
Every decision below is something I’ve either shipped to production and would do again, or something I’ve spent months deconstructing to pay down technical debt.
If I were starting a greenfield project today, or stepping into a new role, here is how I would weigh the decisions that actually move the needle on business outcomes.
No theoretical fence-sitting. No playing it safe with vendor-neutrality. This is the raw reality of scaling Azure under pressure.
Azure
🟩 Endorse
Picking Azure over AWS
The Azure Advantages (Why I’d do it again):
The Container Gold Mine: AWS AppRunner is a non-starter. If you want true serverless containers that actually scale to zero, Azure Container Apps (ACA) is the industry leader. Similarly, AKS feels like a cohesive product, whereas EKS often feels like a collection of parts you have to bolt together yourself.
Identity that Works: Microsoft Entra ID (formerly Azure AD) is the undisputed heavyweight champion. Compared to the headache of AWS Cognito, Entra ID is a robust, enterprise-grade solution that I’ve relied on to secure highly regulated, global platforms.
The Developer Ecosystem: Between GitHub Actions and Azure DevOps, the CI/CD story is just tighter. When you’re pushing for on-time delivery, having your repos, boards, and pipelines in one ecosystem is a massive force multiplier.
The AWS Envy (The "Regrets" and Hurdles):
Networking Flexibility: I’ll be honest, Route 53 is easier to wield than Azure DNS. AWS handles global routing and health checks with a level of simplicity that Azure's Front Door and Traffic Manager can sometimes overcomplicate.
Tooling & Local Dev: Azure still lacks a true "LocalStack" equivalent. Being able to emulate the entire cloud locally is a huge win for AWS developers. In the Azure world, we rely more on emulators like Azurite, but it’s not quite the same "cloud-in-a-box" experience.
Deployment Velocity: There’s no sugar-coating it, Azure Resource Manager (ARM) and even some Bicep deployments can be painfully slow. Watching a gateway or a firewall provision for 20 minutes is a "coffee break" I’d rather not have to take.
The Talent War: It’s still a reality that finding AWS-specialised engineers is easier on the open market. Building the high-performing Azure teams I have led in the past required more internal "up-skilling" and intentional training programs because the talent pool is shallower.
Security Group Fatigue: While Azure’s NSGs and ASGs are logical, tracing a routing issue through multiple VNet peers and UDRs (User Defined Routes) can become "complicated AF" compared to the more fluid networking model AWS sometimes offers.
The Bottom Line:
I’ve shipped on both. AWS might have the "easy" button for networking and local dev, but for enterprise-scale automation, serverless maturity, and AI-driven efficiency, I’m putting my money on the Azure stack every single time.
Azure Container Apps (ACA) vs AKS
🟩 What I Endorse: Use Azure Container Apps (ACA) for the majority of microservices. In previous years, we went straight to AKS (Azure Kubernetes Service). While I’ve successfully utilised Kubernetes to optimize cloud operations for global organisations, the cognitive load on the team is significant. ACA provides the "Serverless Container" experience that allows teams to scale to zero and focus on the code, not the control plane. Having managed 24/7 operational environments, I’ve learned that the best infrastructure is the one you don't have to babysit. ACA gives you the "Serverless Container" benefit without the "Kubernetes Tax."
🟧 What I regret: Using AKS (Azure Kubernetes Service) for small, product-focused teams. I regret the times I built a full K8s orchestration layer for a simple API. Unless you have the resilient team structures in place to manage the complexity of ingress controllers, service meshes, and node pools, it’s an unnecessary tax on productivity.
Azure SQL vs Cosmos DB
🟩 What I Endorse: Default to Azure SQL (Hyperscale or Managed Instance) for 80% of enterprise workloads. Relational data is the backbone of sectors like Insurance and Finance. I’ve found that the reliability and familiar tooling of SQL Server, scaled via Azure’s elastic tiers, solve most business problems without the complexity of distributed NoSQL. Azure SQL is battle-tested. It scales, it’s familiar, and it’s consistently reliable for the heavy lifting I've overseen.
🟧 What I regret: Using Cosmos DB as a "catch-all" database.
Cosmos DB is incredible for global scale and low-latency NoSQL needs. However, I regret using it in scenarios where the data schema was actually quite rigid. If you don't need global distribution or a flexible schema, Cosmos DB is an expensive way to realize you should have just used a JOIN. Cosmos DB is an engineering marvel, but the cost of mismanagement (improper partitioning) is high. I now reserve Cosmos for specific, high-scale event stores rather than general-purpose storage.
Terraform vs Bicep
🟩 What I Endorse: Stick with Terraform for hybrid and multi-cloud environments. My experience reflects a heavy reliance on Terraform and ARM templates because consistency is king. Its ability to manage state and provide a platform-agnostic language is vital when coordinating hybrid cloud infrastructure. In lots of previous roles I championed the migration of applications to Azure; Terraform provided the common language that allowed us to bridge the gap between legacy on-prem and cloud enablement.
🟧 What I regret: Waiting too long to adopt Azure Bicep for Azure-only projects. If you are 100% in the Azure ecosystem, the "Day 0" support for new features in Bicep is superior to waiting for provider updates in Terraform. When I first started working with the cloud it was in AWS, and the regret was often not using CDK. In Azure, the regret is sticking with verbose, nested ARM Templates for too long. For Azure-native shops, Bicep offers a much cleaner abstraction. While I still champion Terraform for the "big picture," I regret not moving our Azure-specific modules to Bicep sooner to lower the cognitive load for the team. However, for the strategic, global portfolios I manage now, Terraform remains the gold standard for state management and modularity.
Azure Functions vs Logic Apps
🟩 What I Endorse: A "Functions-first" approach for event-driven logic. Utilising C# and Python Functions has allowed my teams to focus on product development rather than infrastructure patching. It’s the ultimate tool for reducing technical debt while maintaining a high velocity. Functions Triggers have "reactivity" baked into the DNA. You aren't writing "glue code" just to move an event to a HTTP call, the integration with Service Bus and Blob Storage is seamless and saves weeks of development time.
🟧 What I regret: Building complex business logic directly into Logic Apps. While Logic Apps are great for "no-code" glue between SaaS products, they are a nightmare for version control, unit testing, and complex CI/CD pipelines. I regret not moving that logic into Azure Functions earlier, where we could apply proper software engineering rigor.
Application Insights & Log Analytics vs Tool Sprawl
🟩 What I Endorse: Investing heavily in Application Insights from day zero. You cannot manage a 24/7 operational environment without a unified view. Also, shifting the perception of IT within an organisation requires data. By leveraging Azure Monitor and App Insights, I’ve been able to show stakeholders real-time dashboards that link system health to business value.
🟧 What I regret: Relying on fragmented third-party monitoring tools.
In high-pressure 24/7 environments, "tool sprawl" is the enemy. I regret the times we had separate logs for the app, the network, and the database that didn't talk to each other. Consolidating into a unified Log Analytics workspace is nearly always the better move.
Azure Kubernetes Service Node Auto-Provisioning (NAP)
🟩 Endorse
If you’re running AKS without Azure Kubernetes Service Node Auto-Provisioning (NAP), which manages the open source project Karpenter, you’re essentially burning money.
In the old world, the Cluster Autoscaler was the bottleneck: slow, rigid, and constantly fighting with manual Node Pool configurations. In the modern Azure ecosystem, Node Auto-Provisioning (based on the Karpenter model) is the game-changer. It’s fast, intelligent, and provisions the exact VM sizes your pods actually need in real-time.
Drawing from my experience optimising cloud operations for global SaaS platforms, we’ve seen 30-40% cost reductions on compute after moving away from static node pools.
Spot Instance Handling: It actually works without the "eviction anxiety" of the past.
True Consolidation: It proactively scales down and moves workloads to ensure you aren't paying for "ghost" capacity.
Real Bin-Packing: No more half-empty D-Series VMs sitting idle; the scheduler finally treats your compute like a fluid resource.
The learning curve: Getting your head around NodePool objects, Provisioning configuration, and taints/tolerations, is real. But for any strategic portfolio in 2026, this isn't an "add-on", it’s a non-negotiable requirement for a resilient, cost-efficient infrastructure.
KEDA (Kubernetes Event-Driven Autoscaling)
🟩 Endorse for Azure event-driven workloads
In my time directing technology transformations, I’ve seen too many teams rely on standard HPA (Horizontal Pod Autoscaler) scaling on CPU/Memory. That’s a blunt instrument. KEDA is the precision tool. While HPA watches your hardware, KEDA watches your business logic: Azure Service Bus queue depth, Event Hub lag, or Storage Queue message counts.
If you’re running workers that process insurance claims or property management updates, KEDA is the answer. We’ve used it to cut costs significantly on batch processing workloads that used to sit idle 24/7 "just in case." Now, they scale to zero when the queue is empty and ramp up instantly based on the actual backlog.
Where it shines in the Azure Stack:
Azure Service Bus / Storage Queues: Scaling based on activeMessageCount. This is the "gold standard" for async processing.
Azure Event Hubs: Scaling based on partition lag to ensure you aren't falling behind on high-throughput data streams.
Scheduled Jobs: Using cron-based scaling for predictable peak periods (e.g., end-of-month reporting), which often outperforms standard Kubernetes CronJobs for long-running processes.
Azure Monitor/Log Analytics: Scaling on custom KQL (Kusto) metrics that your application exposes.
Where it doesn’t:
Request-based scaling: If you're scaling a public-facing API based on traffic, stick with HPA + Azure Application Gateway/Ingress metrics.
Latency-sensitive "Instant-on" workloads: If your application cannot handle a cold start from zero, keep a minimum replica count.
The Strategy: Use KEDA (ScaledObjects) for your async workers and background processors, and HPA for your synchronous REST APIs. They coexist perfectly in the same AKS cluster, one handles the reactive "pull" work, the other handles the proactive "push" traffic.
The Actual Lessons:
Two decades in the trenches have distilled my architectural "religion" down to these hard-won truths. If you’re building on Azure in 2026, this is the blueprint.
Non-negotiable:
AKS + Node Auto-Provisioning (NAP): If you aren't using the Karpenter-based model, you’re burning money on static node pools.
Azure Container Apps (ACA) for Microservices: Stop paying the "Kubernetes Tax" for simple APIs. Focus on the code, not the control plane.
Entra ID (Azure AD): The undisputed heavyweight champion of identity. If you aren't using it to secure your global platforms, you're creating a headache for your future self. Use it for everything. OIDC federation is the only way to kill long-lived CI credentials.
Terraform for the Big Picture: Its ability to manage state and provide a platform-agnostic language is vital for coordinating hybrid cloud infrastructure.
KEDA for Async Workers: If it’s an event-driven consumer, scale on queue depth, not CPU. Precision scaling based on Azure Service Bus depth or Event Hub lag. Scale to zero or fail your budget.
Application Insights from Day Zero: End-to-end observability is the only way to manage 24/7 operational environments and shift the perception of IT within an organisation.
Avoid at all costs:
Logic Apps for Complex Logic: It’s a CI/CD and unit-testing nightmare. Keep the "glue" in Azure Functions.
Cosmos DB as a "Catch-all": It’s an expensive way to store relational data. Unless you truly need global sub-10ms latency, stick to Azure SQL. Reserve it for global-scale NoSQL, not rigid schemas.
AKS for Small Teams: Don't build a full K8s orchestration layer for a simple API unless you have the platform team to babysit it.
Tool Sprawl: Fragmented third-party monitoring is the enemy of incident response. Stop paying for three different monitoring tools that don't talk to each other. Consolidate into a unified Log Analytics workspace.
Verbose ARM Templates: Don't stay in the "nested ARM" world for Azure-native projects; the cognitive load isn't worth it.
Niche wins worth knowing:
Innersourcing: The most effective way to improve culture and reduce technical debt across global silos.
The Meta-Lessons:
Ship Business Outcomes, Not Infrastructure: My most successful projects weren't because of a "perfect" cluster; they were because the infrastructure allowed the product to scale without the team burning out.
Boring Technology is Resilient Technology: The C# Azure Function that’s been running for three years without a restart is worth more than the experimental service mesh that requires a weekly post-mortem.
Automation is Your Legacy: In a high-pressure environment automation isn't a luxury, it’s your only defense against chaos.
Business Observability: If your logs and metrics don't link system health to business value in 15 minutes, your architecture is too clever for its own good.
Connect on LinkedIn
Top comments (0)