<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Devang Goyal</title>
    <description>The latest articles on DEV Community by Devang Goyal (@clouddevang).</description>
    <link>https://dev.to/clouddevang</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3934636%2Fa1a6768b-b33f-4a80-a6ee-49423ee429a5.png</url>
      <title>DEV Community: Devang Goyal</title>
      <link>https://dev.to/clouddevang</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/clouddevang"/>
    <language>en</language>
    <item>
      <title>Building Zero-Trust Infrastructure on Azure: A Production Story</title>
      <dc:creator>Devang Goyal</dc:creator>
      <pubDate>Sat, 16 May 2026 10:44:55 +0000</pubDate>
      <link>https://dev.to/clouddevang/building-zero-trust-infrastructure-on-azure-a-production-story-1dee</link>
      <guid>https://dev.to/clouddevang/building-zero-trust-infrastructure-on-azure-a-production-story-1dee</guid>
      <description>&lt;p&gt;When I joined the platform team at a financial services company, I inherited an infrastructure that, while functional, had significant security gaps. APIs were exposed to the public internet, database connections traversed public networks, and secret management relied on application configuration files. This is the story of how we transformed that architecture into a true zero-trust environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Trust Boundaries Were Too Wide
&lt;/h2&gt;

&lt;p&gt;Our initial architecture followed a common anti-pattern: everything inside the "corporate network" was trusted. Azure App Services communicated with Azure SQL over public endpoints. Key Vault secrets were fetched using connection strings stored in app settings. Storage accounts accepted requests from any IP address.&lt;/p&gt;

&lt;p&gt;The reality of modern cloud architecture is that &lt;strong&gt;there is no perimeter&lt;/strong&gt;. Zero-trust assumes that every request, whether internal or external, must be authenticated and authorized. Our infrastructure violated this principle at multiple levels.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture Redesign
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. VNet Integration for All Compute
&lt;/h3&gt;

&lt;p&gt;The first major change was enabling VNet integration for every compute resource. Azure App Services, Azure Functions, and Azure Container Apps were all connected to a dedicated virtual network.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VNet Architecture:
├── Management Subnet (10.0.1.0/24)
│   └── Jumpbox, Bastion
├── App Subnet (10.0.2.0/24)
│   └── App Services, Functions
├── Container Subnet (10.0.3.0/24)
│   └── Container Apps
└── Data Subnet (10.0.4.0/24)
    └── Private Endpoints
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With VNet integration, outbound traffic from our applications now routes through the virtual network, allowing us to control egress through Network Security Groups and route tables.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Private Endpoints for Data Services
&lt;/h3&gt;

&lt;p&gt;The most critical change was eliminating public endpoints for all data services. Azure SQL, Key Vault, Storage Accounts, and Service Bus were all configured with private endpoints.&lt;/p&gt;

&lt;p&gt;Private endpoints create a network interface inside your VNet with a private IP address. When your application connects to &lt;code&gt;yourdb.database.windows.net&lt;/code&gt;, DNS resolution returns the private IP (e.g., &lt;code&gt;10.0.4.10&lt;/code&gt;) instead of the public IP.&lt;/p&gt;

&lt;p&gt;This required careful DNS configuration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Private DNS Zones&lt;/strong&gt;: We created private DNS zones for each service type (&lt;code&gt;privatelink.database.windows.net&lt;/code&gt;, &lt;code&gt;privatelink.vaultcore.azure.net&lt;/code&gt;, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VNet Links&lt;/strong&gt;: Each private DNS zone was linked to our VNet&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Record Management&lt;/strong&gt;: Private endpoints automatically register A records in these zones&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result: &lt;strong&gt;zero public database exposure&lt;/strong&gt;. Even if an attacker compromised our application, they couldn't exfiltrate data over the internet because our SQL Server doesn't have a public IP.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. RBAC-Enforced Key Vault Access
&lt;/h3&gt;

&lt;p&gt;Instead of connection strings, we moved to managed identity authentication with RBAC. Each application is assigned a system-managed identity, and Key Vault access is granted through role assignments.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Old approach - connection string&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;SecretClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;vaultUrl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;DefaultAzureCredential&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

&lt;span class="c1"&gt;// New approach - same code, but identity is VNet-integrated&lt;/span&gt;
&lt;span class="c1"&gt;// and Key Vault only accepts requests from our VNet&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;SecretClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;vaultUrl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;DefaultAzureCredential&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code didn't change, but the security posture did. Key Vault now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rejects requests from public internet&lt;/li&gt;
&lt;li&gt;Only accepts requests from our VNet via private endpoint&lt;/li&gt;
&lt;li&gt;Requires managed identity authentication (no secrets to manage)&lt;/li&gt;
&lt;li&gt;Enforces RBAC permissions (least-privilege access)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Service Endpoints for Azure SQL
&lt;/h3&gt;

&lt;p&gt;While private endpoints are ideal for most scenarios, we also used service endpoints for Azure SQL to provide defense in depth. Service endpoints route traffic through Azure's backbone network while allowing firewall rules at the SQL Server level.&lt;/p&gt;

&lt;p&gt;Our SQL Server firewall configuration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deny public access&lt;/strong&gt;: Toggle disabled&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Virtual network rules&lt;/strong&gt;: Allow traffic from app subnet only&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Private endpoint&lt;/strong&gt;: Primary access method&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means even if someone obtained valid credentials, they couldn't connect from outside our VNet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  DNS is Everything
&lt;/h3&gt;

&lt;p&gt;The most challenging aspect wasn't the security configuration—it was DNS. When you enable private endpoints, you need to ensure that DNS resolution works correctly both from within Azure and from developer workstations.&lt;/p&gt;

&lt;p&gt;We implemented split-brain DNS:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Inside VNet: Private DNS zones return private IPs&lt;/li&gt;
&lt;li&gt;Outside VNet: Azure DNS returns public IPs (which are blocked by firewall)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For local development, developers connect via VPN, and their DNS queries route to Azure DNS, resolving to private endpoints.&lt;/p&gt;

&lt;h3&gt;
  
  
  Managed Identity Adoption Takes Time
&lt;/h3&gt;

&lt;p&gt;Moving from connection strings to managed identity required updating every application. Some third-party libraries didn't support managed identity initially, requiring workarounds or upgrades.&lt;/p&gt;

&lt;p&gt;The key was implementing changes incrementally:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Enable managed identity on the resource&lt;/li&gt;
&lt;li&gt;Grant RBAC permissions&lt;/li&gt;
&lt;li&gt;Update application code to use &lt;code&gt;DefaultAzureCredential&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Remove the old connection string&lt;/li&gt;
&lt;li&gt;Verify with monitoring&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Cost Considerations
&lt;/h3&gt;

&lt;p&gt;Private endpoints aren't free. Each private endpoint incurs a small hourly cost plus data processing charges. For a large deployment with many endpoints, this adds up.&lt;/p&gt;

&lt;p&gt;We optimized costs by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consolidating storage accounts where possible&lt;/li&gt;
&lt;li&gt;Using service endpoints as a complement (free)&lt;/li&gt;
&lt;li&gt;Implementing shared private endpoints for multi-region deployments&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;p&gt;After implementing zero-trust architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero public database exposure&lt;/strong&gt;: All data services are private endpoint only&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;50% reduction in attack surface&lt;/strong&gt;: No public IPs on backend infrastructure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simplified secret management&lt;/strong&gt;: Managed identity eliminated most secrets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Improved compliance posture&lt;/strong&gt;: SOC 2 and PCI DSS audits became straightforward&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most important outcome wasn't technical—it was cultural. The team now defaults to private, authenticated, authorized communication for every new service. Zero-trust isn't a destination; it's a way of building systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building zero-trust infrastructure on Azure requires careful planning, especially around networking and DNS. But the security benefits are substantial. By eliminating implicit trust and enforcing authentication at every boundary, we've created an architecture that's resilient to both external attacks and internal compromise.&lt;/p&gt;

&lt;p&gt;If you're starting a similar journey, begin with VNet integration. Once your compute resources are in a VNet, private endpoints and RBAC become natural extensions. And remember: zero-trust is a principle, not a product. Every architecture decision should ask, "What happens if this is compromised?"&lt;/p&gt;

</description>
      <category>azure</category>
      <category>security</category>
      <category>sre</category>
    </item>
    <item>
      <title>SLOs, SLIs, and Error Budgets: A Practical Guide for SREs</title>
      <dc:creator>Devang Goyal</dc:creator>
      <pubDate>Sat, 16 May 2026 10:15:45 +0000</pubDate>
      <link>https://dev.to/clouddevang/slos-slis-and-error-budgets-a-practical-guide-for-sres-5bmc</link>
      <guid>https://dev.to/clouddevang/slos-slis-and-error-budgets-a-practical-guide-for-sres-5bmc</guid>
      <description>&lt;p&gt;Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets form the foundation of Site Reliability Engineering. Yet many teams struggle to implement them effectively. This guide shares practical lessons from implementing SLO-based reliability practices in production financial systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the SRE Reliability Stack
&lt;/h2&gt;

&lt;p&gt;Before diving into implementation, let's clarify the hierarchy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SLI (Service Level Indicator)&lt;/strong&gt;: A quantitative measure of service behavior (e.g., "99.2% of requests completed in under 200ms")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SLO (Service Level Objective)&lt;/strong&gt;: The target value for an SLI (e.g., "99.9% of requests should complete in under 200ms")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SLA (Service Level Agreement)&lt;/strong&gt;: A contract with consequences for missing SLOs (e.g., "If we miss 99.9%, customers get credits")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error Budget&lt;/strong&gt;: The allowed failure rate (e.g., "0.1% of requests can fail per month")&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Choosing the Right SLIs
&lt;/h2&gt;

&lt;p&gt;The most common mistake teams make is tracking too many SLIs. Start with these four golden signals:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Availability
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;availability = successful_requests / total_requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For an API, this might be: "Percentage of HTTP requests returning 2xx or expected 4xx status codes."&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Latency
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;latency_sli = requests_under_threshold / total_requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Track at multiple percentiles: p50 for typical experience, p99 for tail latency. For financial systems, we use p99.9.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Throughput
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;throughput = successful_requests_per_second
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Critical for batch processing systems and data pipelines.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Error Rate
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;error_rate = failed_requests / total_requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Distinguish between client errors (4xx) and server errors (5xx)—only count 5xx against your error budget.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Realistic SLOs
&lt;/h2&gt;

&lt;p&gt;Here's a framework I use for setting SLOs:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Measure Current Performance
&lt;/h3&gt;

&lt;p&gt;Don't guess. Run your system for 2-4 weeks and measure actual performance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Example query for availability over 30 days&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;availability&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;request_logs&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'30 days'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Understand User Expectations
&lt;/h3&gt;

&lt;p&gt;Interview stakeholders:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What latency do users notice?&lt;/li&gt;
&lt;li&gt;How much downtime is acceptable?&lt;/li&gt;
&lt;li&gt;What's the business impact of degradation?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: Set Achievable Targets
&lt;/h3&gt;

&lt;p&gt;If your current availability is 99.5%, don't set an SLO of 99.99%. Start with 99.7% and improve incrementally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pro tip&lt;/strong&gt;: Your SLO should be slightly below your actual performance. This gives you room to experiment and deploy without constant alerts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing Error Budgets
&lt;/h2&gt;

&lt;p&gt;Error budgets are the game-changer. They answer: "How much unreliability can we tolerate?"&lt;/p&gt;

&lt;h3&gt;
  
  
  Calculating Error Budget
&lt;/h3&gt;

&lt;p&gt;For a 99.9% availability SLO over 30 days:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error Budget = (1 - 0.999) × 30 days × 24 hours × 60 minutes
             = 0.001 × 43,200 minutes
             = 43.2 minutes of downtime allowed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Error Budget Policy
&lt;/h3&gt;

&lt;p&gt;Here's the policy we implemented:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Budget Remaining&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;gt; 50%&lt;/td&gt;
&lt;td&gt;Normal development velocity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25-50%&lt;/td&gt;
&lt;td&gt;Increased review rigor, limit risky changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10-25%&lt;/td&gt;
&lt;td&gt;Feature freeze, focus on reliability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 10%&lt;/td&gt;
&lt;td&gt;All hands on reliability, no new features&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Burn Rate Alerts
&lt;/h3&gt;

&lt;p&gt;Instead of alerting on instantaneous errors, alert on burn rate—how fast you're consuming your error budget:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Prometheus alert for fast burn rate&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HighErrorBudgetBurn&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;(&lt;/span&gt;
      &lt;span class="s"&gt;sum(rate(http_requests_total{status=~"5.."}[1h]))&lt;/span&gt;
      &lt;span class="s"&gt;/ sum(rate(http_requests_total[1h]))&lt;/span&gt;
    &lt;span class="s"&gt;) &amp;gt; (14.4 * 0.001)  # 14.4x burn rate = budget exhausted in 5 days&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Real-World Implementation: A Case Study
&lt;/h2&gt;

&lt;p&gt;At BitFlyer, we implemented SLOs for our trading API:&lt;/p&gt;

&lt;h3&gt;
  
  
  Initial State
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;No formal SLOs&lt;/li&gt;
&lt;li&gt;Alerts on arbitrary thresholds&lt;/li&gt;
&lt;li&gt;Constant alert fatigue&lt;/li&gt;
&lt;li&gt;No clear prioritization&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Implementation Steps
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Week 1-2: Instrumentation&lt;/strong&gt;&lt;br&gt;
We added OpenTelemetry instrumentation to capture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Request duration histograms&lt;/li&gt;
&lt;li&gt;Status code counters&lt;/li&gt;
&lt;li&gt;Dependency latencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Week 3-4: Baseline Measurement&lt;/strong&gt;&lt;br&gt;
Measured actual performance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Availability: 99.89%&lt;/li&gt;
&lt;li&gt;P99 latency: 180ms&lt;/li&gt;
&lt;li&gt;Error rate: 0.08%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Week 5-6: SLO Definition&lt;/strong&gt;&lt;br&gt;
Set initial SLOs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Availability SLO: 99.9% (gives 43 min/month budget)&lt;/li&gt;
&lt;li&gt;Latency SLO: 99% of requests &amp;lt; 200ms&lt;/li&gt;
&lt;li&gt;Error rate SLO: &amp;lt; 0.1% server errors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Week 7-8: Alerting Migration&lt;/strong&gt;&lt;br&gt;
Replaced 47 arbitrary alerts with 6 SLO-based alerts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2 availability burn rate alerts (fast/slow)&lt;/li&gt;
&lt;li&gt;2 latency burn rate alerts (fast/slow)&lt;/li&gt;
&lt;li&gt;2 error rate burn rate alerts (fast/slow)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Results After 3 Months
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Alert volume reduced by 73%&lt;/li&gt;
&lt;li&gt;MTTR improved by 45%&lt;/li&gt;
&lt;li&gt;Engineering velocity increased (fewer interruptions)&lt;/li&gt;
&lt;li&gt;Clear prioritization framework for incidents&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common Pitfalls to Avoid
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. SLO Perfection Syndrome
&lt;/h3&gt;

&lt;p&gt;Don't aim for 100% availability. It's:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mathematically impossible&lt;/li&gt;
&lt;li&gt;Prohibitively expensive&lt;/li&gt;
&lt;li&gt;Prevents innovation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The difference between 99.9% and 99.99% is a 10x cost increase for most systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Too Many SLOs
&lt;/h3&gt;

&lt;p&gt;Start with 3-5 SLOs per service. More creates confusion and alert fatigue.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Ignoring Dependencies
&lt;/h3&gt;

&lt;p&gt;Your service's SLO is bounded by your dependencies' SLOs. If your database has 99.9% availability, you cannot achieve 99.99% for your API.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Set and Forget
&lt;/h3&gt;

&lt;p&gt;Review SLOs quarterly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Are they still relevant?&lt;/li&gt;
&lt;li&gt;Are they too tight (constant alerts) or too loose (not protecting users)?&lt;/li&gt;
&lt;li&gt;Has the business context changed?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Tooling Recommendations
&lt;/h2&gt;

&lt;p&gt;For implementing SLOs, consider:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Metrics Collection&lt;/strong&gt;: Prometheus, Datadog, or Azure Monitor&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SLO Tracking&lt;/strong&gt;: Sloth, Google SLO Generator, or Datadog SLO&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error Budget Visualization&lt;/strong&gt;: Grafana dashboards, custom Datadog dashboards&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alerting&lt;/strong&gt;: PagerDuty, Opsgenie integrated with burn rate alerts&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;SLOs, SLIs, and error budgets aren't just metrics—they're a cultural shift toward data-driven reliability decisions. Start simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Instrument your critical paths&lt;/li&gt;
&lt;li&gt;Measure for 2-4 weeks&lt;/li&gt;
&lt;li&gt;Set conservative SLOs&lt;/li&gt;
&lt;li&gt;Implement burn rate alerting&lt;/li&gt;
&lt;li&gt;Create an error budget policy&lt;/li&gt;
&lt;li&gt;Review and iterate quarterly&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The goal isn't perfect reliability—it's appropriate reliability that balances user happiness with engineering velocity.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have questions about implementing SLOs? Connect with me on &lt;a href="https://linkedin.com/in/devang20" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; or reach out via the contact form.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>observability</category>
      <category>reliability</category>
    </item>
    <item>
      <title>OpenTelemetry in Practice: Vendor-Agnostic Observability at Scale</title>
      <dc:creator>Devang Goyal</dc:creator>
      <pubDate>Sat, 16 May 2026 10:15:43 +0000</pubDate>
      <link>https://dev.to/clouddevang/opentelemetry-in-practice-vendor-agnostic-observability-at-scale-4c4m</link>
      <guid>https://dev.to/clouddevang/opentelemetry-in-practice-vendor-agnostic-observability-at-scale-4c4m</guid>
      <description>&lt;p&gt;When we started redesigning our customer-facing platform, observability was a first-class concern. We had been using a mix of Azure Application Insights, custom logging, and ad-hoc metrics—a common pattern that leads to gaps in visibility and vendor lock-in. This time, we chose OpenTelemetry (OTel) as our observability foundation. Here's what we learned implementing it in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why OpenTelemetry?
&lt;/h2&gt;

&lt;p&gt;OpenTelemetry is a CNCF project that provides vendor-neutral APIs, SDKs, and tools for collecting telemetry data. The key benefits:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Vendor Flexibility&lt;/strong&gt;: Export to any backend (Datadog, Jaeger, Azure Monitor, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified API&lt;/strong&gt;: One SDK for traces, metrics, and logs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Industry Standard&lt;/strong&gt;: Growing ecosystem of instrumentation libraries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Future-Proof&lt;/strong&gt;: Active community and broad industry adoption&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We chose Datadog as our initial backend, but the real value is flexibility. When costs or features change, we can switch backends without rewriting instrumentation code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Pillars, Unified
&lt;/h2&gt;

&lt;p&gt;OpenTelemetry handles three types of telemetry:&lt;/p&gt;

&lt;h3&gt;
  
  
  Traces
&lt;/h3&gt;

&lt;p&gt;Distributed traces follow a request across service boundaries. Each span represents a unit of work with timing, attributes, and relationships to other spans.&lt;/p&gt;

&lt;h3&gt;
  
  
  Metrics
&lt;/h3&gt;

&lt;p&gt;Numerical measurements like request counts, latency percentiles, and business metrics. OTel supports counters, gauges, and histograms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Logs
&lt;/h3&gt;

&lt;p&gt;Structured log records with context. OTel logs include trace context, enabling correlation between logs and traces.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation Architecture
&lt;/h2&gt;

&lt;p&gt;Our architecture uses the OTel Collector as a central aggregation point:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Application] → [OTel SDK] → [OTel Collector] → [Datadog]
                                      ↓
                               [Azure Monitor] (backup)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Collector provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Buffering&lt;/strong&gt;: Handles backend unavailability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Processing&lt;/strong&gt;: Sampling, filtering, attribute manipulation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-export&lt;/strong&gt;: Send to multiple backends simultaneously&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SDK Configuration
&lt;/h3&gt;

&lt;p&gt;We use the .NET OpenTelemetry SDK. Here's our configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="n"&gt;services&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AddOpenTelemetry&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ConfigureResource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;resource&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AddService&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serviceName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"payment-service"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AddAttributes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;Dictionary&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;object&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"environment"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Environment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;GetEnvironmentVariable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"ASPNETCORE_ENVIRONMENT"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Assembly&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;GetExecutingAssembly&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;GetName&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;Version&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ToString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WithTracing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tracing&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;tracing&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AddAspNetCoreInstrumentation&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AddHttpClientInstrumentation&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AddSqlClientInstrumentation&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AddSource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"PaymentService"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AddOtlpExporter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Endpoint&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;Uri&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"http://otel-collector:4317"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WithMetrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AddAspNetCoreInstrumentation&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AddHttpClientInstrumentation&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AddMeter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"PaymentService"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AddOtlpExporter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Endpoint&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;Uri&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"http://otel-collector:4317"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key configuration choices:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Resource attributes&lt;/strong&gt;: Service name, environment, and version tag every signal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-instrumentation&lt;/strong&gt;: ASP.NET Core, HttpClient, and SQL are instrumented automatically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom sources&lt;/strong&gt;: Our business logic emits additional spans and metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OTLP export&lt;/strong&gt;: The OpenTelemetry Protocol is the native format for the Collector&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Custom Instrumentation
&lt;/h3&gt;

&lt;p&gt;Auto-instrumentation covers HTTP and database calls, but business logic needs manual spans:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PaymentProcessor&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="n"&gt;ActivitySource&lt;/span&gt; &lt;span class="n"&gt;ActivitySource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"PaymentService"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="n"&gt;Meter&lt;/span&gt; &lt;span class="n"&gt;Meter&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"PaymentService"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;long&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;PaymentsProcessed&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;
        &lt;span class="n"&gt;Meter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CreateCounter&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;long&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;(&lt;/span&gt;&lt;span class="s"&gt;"payments.processed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"count"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt; &lt;span class="nf"&gt;ProcessPayment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Payment&lt;/span&gt; &lt;span class="n"&gt;payment&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="nn"&gt;var&lt;/span&gt; &lt;span class="n"&gt;activity&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ActivitySource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;StartActivity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"ProcessPayment"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;activity&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nf"&gt;SetTag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"payment.amount"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Amount&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;activity&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nf"&gt;SetTag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"payment.currency"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Currency&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="k"&gt;try&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// Business logic&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;ValidatePayment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payment&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;ExecutePayment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payment&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

            &lt;span class="n"&gt;PaymentsProcessed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;KeyValuePair&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;object&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;(&lt;/span&gt;&lt;span class="s"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;KeyValuePair&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;object&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;(&lt;/span&gt;&lt;span class="s"&gt;"currency"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Currency&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Exception&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;activity&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nf"&gt;SetStatus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ActivityStatusCode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="n"&gt;PaymentsProcessed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;KeyValuePair&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;object&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;(&lt;/span&gt;&lt;span class="s"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"failure"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
            &lt;span class="k"&gt;throw&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A span for each payment with amount and currency attributes&lt;/li&gt;
&lt;li&gt;A counter metric with success/failure dimensions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Structured Logging with Trace Context
&lt;/h3&gt;

&lt;p&gt;OTel logs aren't just text—they're structured records with trace context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;LogInformation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"Payment {PaymentId} processed for {Amount} {Currency}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;payment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Currency&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The OTel logging bridge automatically adds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;trace_id&lt;/code&gt;: Links this log to the active trace&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;span_id&lt;/code&gt;: Links to the specific span&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;severity&lt;/code&gt;: Derived from the log level&lt;/li&gt;
&lt;li&gt;Structured attributes from the message template&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In Datadog, clicking on a log entry shows the full trace that generated it. No correlation IDs to manage manually.&lt;/p&gt;

&lt;h2&gt;
  
  
  Collector Configuration
&lt;/h2&gt;

&lt;p&gt;The OTel Collector is the heart of our observability pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;protocols&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;grpc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:4317&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:4318&lt;/span&gt;

&lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
    &lt;span class="na"&gt;send_batch_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10000&lt;/span&gt;
  &lt;span class="na"&gt;memory_limiter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;check_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1s&lt;/span&gt;
    &lt;span class="na"&gt;limit_mib&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
  &lt;span class="na"&gt;attributes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;actions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;team&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;insert&lt;/span&gt;

&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;datadog&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;api&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${DD_API_KEY}&lt;/span&gt;
  &lt;span class="na"&gt;azuremonitor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;connection_string&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${AZURE_MONITOR_CONNECTION_STRING}&lt;/span&gt;

&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;memory_limiter&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;attributes&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;datadog&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;azuremonitor&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;memory_limiter&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;datadog&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key patterns:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Batching&lt;/strong&gt;: Reduces network overhead by sending telemetry in batches&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory limiting&lt;/strong&gt;: Prevents collector OOM during traffic spikes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attribute injection&lt;/strong&gt;: Adds consistent tags across all telemetry&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-export&lt;/strong&gt;: Primary to Datadog, backup to Azure Monitor&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Sampling is Essential
&lt;/h3&gt;

&lt;p&gt;At scale, 100% trace sampling is expensive. We use a combination:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Head-based sampling&lt;/strong&gt;: 10% of all traces&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tail-based sampling&lt;/strong&gt;: 100% of error traces&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Priority sampling&lt;/strong&gt;: 100% for critical paths&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Collector's tail sampling processor examines completed traces before deciding to keep them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cardinality Matters
&lt;/h3&gt;

&lt;p&gt;High-cardinality attributes (user IDs, request IDs) on metrics create explosion in metric storage. We learned to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use high-cardinality attributes only on traces&lt;/li&gt;
&lt;li&gt;Keep metric dimensions bounded (status codes, service names, regions)&lt;/li&gt;
&lt;li&gt;Use exemplars to link metrics to representative traces&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Context Propagation is Tricky
&lt;/h3&gt;

&lt;p&gt;Traces only work if context propagates correctly. We encountered issues with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Async boundaries&lt;/strong&gt;: Ensure activity context flows to background tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Message queues&lt;/strong&gt;: Propagate trace context in message headers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-language services&lt;/strong&gt;: Use W3C Trace Context format for compatibility&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Start with Auto-Instrumentation
&lt;/h3&gt;

&lt;p&gt;Don't try to instrument everything manually. Start with auto-instrumentation libraries for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HTTP servers and clients&lt;/li&gt;
&lt;li&gt;Database clients&lt;/li&gt;
&lt;li&gt;Message queue clients&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Add custom instrumentation incrementally for business-specific visibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;p&gt;After implementing OpenTelemetry:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mean time to detection&lt;/strong&gt;: Reduced by 50% with correlated traces and logs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-service debugging&lt;/strong&gt;: Single trace view shows entire request flow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend flexibility&lt;/strong&gt;: Successfully tested migration to alternative backends&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost visibility&lt;/strong&gt;: Metrics show resource consumption per feature&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most valuable outcome: when incidents occur, engineers start with a trace, not a sea of logs. Root cause identification that used to take hours now takes minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;OpenTelemetry requires upfront investment—SDK configuration, Collector deployment, team education. But the payoff is substantial: unified observability that's not locked to any vendor.&lt;/p&gt;

&lt;p&gt;If you're starting fresh, OpenTelemetry is the clear choice. If you're migrating from a proprietary solution, start with new services and gradually expand. The ecosystem is mature enough for production use, and the community is only growing.&lt;/p&gt;

&lt;p&gt;The future of observability is open standards. OpenTelemetry is that standard.&lt;/p&gt;

</description>
      <category>observability</category>
      <category>otel</category>
      <category>datadog</category>
    </item>
    <item>
      <title>Migrating from Community ingress-nginx to F5 NGINX Ingress Controller Across 3 AKS Clusters</title>
      <dc:creator>Devang Goyal</dc:creator>
      <pubDate>Sat, 16 May 2026 10:09:34 +0000</pubDate>
      <link>https://dev.to/clouddevang/migrating-from-community-ingress-nginx-to-f5-nginx-ingress-controller-across-3-aks-clusters-5g2h</link>
      <guid>https://dev.to/clouddevang/migrating-from-community-ingress-nginx-to-f5-nginx-ingress-controller-across-3-aks-clusters-5g2h</guid>
      <description>&lt;p&gt;Earlier this month I migrated three production AKS clusters off the community &lt;code&gt;ingress-nginx&lt;/code&gt; controller and onto the F5 NGINX Ingress Controller OSS (v2.5.1). The three workloads were a compliance API service, a real-time WebSocket trading server, and a charting frontend. Same controller name, completely different internals — and enough sharp edges to fill a post.&lt;/p&gt;

&lt;p&gt;This is the full account: what changed, what broke, and the patterns I standardised across all three.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Migrate
&lt;/h2&gt;

&lt;p&gt;The community Helm chart (&lt;code&gt;kubernetes/ingress-nginx&lt;/code&gt;) and the F5 chart (&lt;code&gt;nginx-stable/nginx-ingress&lt;/code&gt;) both proxy traffic through NGINX, but they diverge at almost every other layer — Helm structure, annotation prefixes, config key names, metrics port, and label selectors. F5 NGINX IC is the upstream-maintained version aligned with NGINX OSS releases and gives tighter control over the NGINX config without relying on the community's annotation translation layer.&lt;/p&gt;

&lt;p&gt;The practical trigger was a mix of factors: the community chart had accumulated workarounds for bugs we no longer needed, the annotation surface was getting hard to audit, and we wanted a single, consistent ingress stack across clusters.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Stayed the Same
&lt;/h2&gt;

&lt;p&gt;Before diving into the diffs, here is what did &lt;strong&gt;not&lt;/strong&gt; change:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IngressClass name remains &lt;code&gt;nginx&lt;/code&gt; in every cluster (no application-level changes needed)&lt;/li&gt;
&lt;li&gt;Azure Load Balancer type (internal where it was internal, public where public)&lt;/li&gt;
&lt;li&gt;cert-manager ClusterIssuers (one field rename, covered below)&lt;/li&gt;
&lt;li&gt;Linkerd injection on controller pods&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Migration Playbook
&lt;/h2&gt;

&lt;p&gt;Every cluster followed the same five-step pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Pull the F5 chart via OCI — no helm repo add needed&lt;/span&gt;
helm pull oci://ghcr.io/nginx/charts/nginx-ingress &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--version&lt;/span&gt; 2.5.1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--destination&lt;/span&gt; /tmp/charts/

&lt;span class="c"&gt;# 2. Verify checksum before touching anything&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"23c866c0531719586570435a4d9a57ac0fb9661fdafd572c8916208cb7b4f225  /tmp/charts/nginx-ingress-2.5.1.tgz"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sha256sum&lt;/span&gt; &lt;span class="nt"&gt;--check&lt;/span&gt;

&lt;span class="c"&gt;# 3. One-time IngressClass migration guard&lt;/span&gt;
&lt;span class="nv"&gt;CONTROLLER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;kubectl get ingressclass nginx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.spec.controller}'&lt;/span&gt; 2&amp;gt;/dev/null &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;CONTROLLER&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"k8s.io/ingress-nginx"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Removing community IngressClass — allowing F5 takeover"&lt;/span&gt;
  kubectl delete ingressclass nginx
&lt;span class="k"&gt;fi&lt;/span&gt;

&lt;span class="c"&gt;# 4. Helm upgrade&lt;/span&gt;
helm upgrade &lt;span class="nt"&gt;--install&lt;/span&gt; nginx-ingress /tmp/charts/nginx-ingress-2.5.1.tgz &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; nginx-ingress &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; values.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--wait&lt;/span&gt; &lt;span class="nt"&gt;--timeout&lt;/span&gt; 5m

&lt;span class="c"&gt;# 5. Verify the right controller is running&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-l&lt;/span&gt; app.kubernetes.io/name&lt;span class="o"&gt;=&lt;/span&gt;nginx-ingress &lt;span class="nt"&gt;-n&lt;/span&gt; nginx-ingress
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step 3 deserves its own section.&lt;/p&gt;




&lt;h2&gt;
  
  
  The IngressClass Immutability Trap
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;spec.controller&lt;/code&gt; on an IngressClass resource is &lt;strong&gt;immutable&lt;/strong&gt; after creation. The community controller sets it to &lt;code&gt;k8s.io/ingress-nginx&lt;/code&gt;; the F5 controller expects &lt;code&gt;nginx.org/ingress-controller&lt;/code&gt;. If you just run &lt;code&gt;helm upgrade&lt;/code&gt;, F5 will fail to adopt the existing IngressClass and create a conflicting one — or worse, silently ignore it and not process any Ingress resources.&lt;/p&gt;

&lt;p&gt;The solution is to delete the IngressClass before the first F5 install. But a naive unconditional delete is dangerous in an idempotent pipeline — if someone reruns the pipeline after migration, they'd delete the already-correct F5-owned IngressClass mid-flight, causing a brief outage.&lt;/p&gt;

&lt;p&gt;The guard condition solves this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;CONTROLLER&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"k8s.io/ingress-nginx"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;kubectl delete ingressclass nginx
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the first successful F5 install, &lt;code&gt;spec.controller&lt;/code&gt; reads &lt;code&gt;nginx.org/ingress-controller&lt;/code&gt;, so every subsequent pipeline run skips the delete. One-time, idempotent, safe.&lt;/p&gt;




&lt;h2&gt;
  
  
  Helm Values: Structural Differences
&lt;/h2&gt;

&lt;p&gt;The community chart uses a flat &lt;code&gt;controller.config&lt;/code&gt; map. F5 nests everything under &lt;code&gt;controller.config.entries&lt;/code&gt;. Small diff, big gotcha if you copy-paste.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Community:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;controller&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;proxy-read-timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;600"&lt;/span&gt;
    &lt;span class="na"&gt;load-balance&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ewma"&lt;/span&gt;
    &lt;span class="na"&gt;use-gzip&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;F5:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;controller&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;entries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;proxy-read-timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;600s"&lt;/span&gt;   &lt;span class="c1"&gt;# note: F5 expects the unit suffix&lt;/span&gt;
      &lt;span class="na"&gt;lb-method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ewma"&lt;/span&gt;            &lt;span class="c1"&gt;# key renamed&lt;/span&gt;
      &lt;span class="c1"&gt;# use-gzip has no equivalent — moved to http-snippets&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A number of community config keys simply do not exist in F5 and are silently ignored if you leave them in. I audited every key against the &lt;a href="https://docs.nginx.com/nginx-ingress-controller/" rel="noopener noreferrer"&gt;F5 config documentation&lt;/a&gt; and removed: &lt;code&gt;allow-snippet-annotations&lt;/code&gt;, &lt;code&gt;allow-backend-server-header&lt;/code&gt;, &lt;code&gt;block-user-agents&lt;/code&gt;, &lt;code&gt;enable-vts-status&lt;/code&gt;, &lt;code&gt;generate-request-id&lt;/code&gt;, &lt;code&gt;limit-req-status-code&lt;/code&gt;, &lt;code&gt;use-forwarded-headers&lt;/code&gt;, &lt;code&gt;use-geoip&lt;/code&gt;, &lt;code&gt;upstream-keepalive-*&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Other keys that F5 &lt;strong&gt;does&lt;/strong&gt; support but with different names:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Community key&lt;/th&gt;
&lt;th&gt;F5 equivalent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;load-balance&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;lb-method&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;proxy-read-timeout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;proxy-read-timeout&lt;/code&gt; + unit suffix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;client-header-timeout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Move to &lt;code&gt;http-snippets&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The full base controller config across all three clusters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;controller&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deployment&lt;/span&gt;
  &lt;span class="na"&gt;enableCustomResources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;      &lt;span class="c1"&gt;# not using VirtualServer CRDs&lt;/span&gt;
  &lt;span class="na"&gt;enableSnippets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;telemetryReporting&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;                   &lt;span class="c1"&gt;# no outbound access to oss.edge.df.f5.com&lt;/span&gt;

  &lt;span class="na"&gt;ingressClass&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx&lt;/span&gt;
    &lt;span class="na"&gt;create&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;setAsDefaultIngress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;

  &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;service.beta.kubernetes.io/azure-load-balancer-health-probe-protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tcp&lt;/span&gt;

  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9113&lt;/span&gt;                      &lt;span class="c1"&gt;# changed from community's default&lt;/span&gt;
    &lt;span class="na"&gt;serviceMonitor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;create&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two settings that tripped things up before I caught them:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;telemetryReporting.enable: false&lt;/code&gt;&lt;/strong&gt; — F5 attempts to phone home to &lt;code&gt;oss.edge.df.f5.com&lt;/code&gt;. In a cluster with no outbound internet on the node pool, this causes the controller pod to crash-loop on startup waiting for the connection to time out. Must be disabled explicitly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;enableCustomResources: false&lt;/code&gt;&lt;/strong&gt; — F5 ships its own CRDs (VirtualServer, TransportServer, Policy). If you leave this enabled and those CRDs aren't pre-installed, the controller crashes. Since all three clusters use standard Kubernetes Ingress resources, I disabled them entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Azure LB health probe&lt;/strong&gt; — The community controller serves &lt;code&gt;/healthz&lt;/code&gt; on port 80. F5 does not. Azure's default HTTP probe on that path will mark all backends unhealthy. Switch to TCP probe.&lt;/p&gt;




&lt;h2&gt;
  
  
  Rate Limiting: From Annotations to NGINX Snippets
&lt;/h2&gt;

&lt;p&gt;Community ingress-nginx ships first-class annotations for rate limiting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# community — applied as ingress annotations&lt;/span&gt;
&lt;span class="na"&gt;nginx.ingress.kubernetes.io/limit-req-rate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;120r/m"&lt;/span&gt;
&lt;span class="na"&gt;nginx.ingress.kubernetes.io/limit-conn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;60"&lt;/span&gt;
&lt;span class="na"&gt;nginx.ingress.kubernetes.io/limit-req-status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;429"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;F5 NGINX IC does not have equivalent annotation primitives. The correct F5 approach is to declare the rate limit zones globally in &lt;code&gt;http-snippets&lt;/code&gt; (controller values) and apply them per-ingress via &lt;code&gt;server-snippets&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controller values — shared zones:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;controller&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;entries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;http-snippets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;geo $app_limit_bypass {&lt;/span&gt;
          &lt;span class="s"&gt;default 0;&lt;/span&gt;
          &lt;span class="s"&gt;&amp;lt;office-cidr-1&amp;gt; 1;&lt;/span&gt;
          &lt;span class="s"&gt;&amp;lt;office-cidr-2&amp;gt; 1;&lt;/span&gt;
        &lt;span class="s"&gt;}&lt;/span&gt;

        &lt;span class="s"&gt;map $app_limit_bypass $app_limit_key {&lt;/span&gt;
          &lt;span class="s"&gt;0 $binary_remote_addr;&lt;/span&gt;
          &lt;span class="s"&gt;1 "";&lt;/span&gt;
        &lt;span class="s"&gt;}&lt;/span&gt;

        &lt;span class="s"&gt;limit_req_zone  $app_limit_key zone=app_rpm:10m rate=120r/m;&lt;/span&gt;
        &lt;span class="s"&gt;limit_conn_zone $app_limit_key zone=app_conn:10m;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Ingress manifest — apply per route:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;nginx.org/server-snippets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;limit_req zone=app_rpm burst=80 nodelay;&lt;/span&gt;
    &lt;span class="s"&gt;limit_req_status 429;&lt;/span&gt;
    &lt;span class="s"&gt;limit_conn app_conn 60;&lt;/span&gt;
    &lt;span class="s"&gt;limit_conn_status 429;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The geo+map pattern lets specific IP ranges (office networks, CI runners, load testing hosts) bypass rate limits by mapping to an empty key — which &lt;code&gt;limit_req_zone&lt;/code&gt; treats as unlimited. This is cleaner than maintaining allow-lists in multiple annotation blocks across ingress manifests.&lt;/p&gt;




&lt;h2&gt;
  
  
  WebSocket Service: Keepalive Surprises
&lt;/h2&gt;

&lt;p&gt;One of the services is a Socket.io server behind WebSocket connections. Everything looked healthy post-migration — pods up, ingress adopted — but Socket.io clients started disconnecting every 30–60 seconds.&lt;/p&gt;

&lt;p&gt;The root cause: F5's default &lt;code&gt;keepalive-timeout&lt;/code&gt; is &lt;code&gt;0s&lt;/code&gt; (disabled), whereas the community chart defaults to &lt;code&gt;60s&lt;/code&gt;. WebSocket connections through NGINX depend on TCP keepalive to stay alive during idle periods. With keepalive disabled, NGINX was closing the connection server-side.&lt;/p&gt;

&lt;p&gt;Fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;controller&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;entries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;keepalive-timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;60s"&lt;/span&gt;
      &lt;span class="na"&gt;http2&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false"&lt;/span&gt;   &lt;span class="c1"&gt;# HTTP/2 and WebSocket upgrades conflict; disable explicitly&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also required adding the F5 WebSocket annotation to the ingress manifest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;nginx.org/websocket-services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-websocket-service"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without this annotation, F5 does not set the necessary &lt;code&gt;Upgrade&lt;/code&gt; and &lt;code&gt;Connection&lt;/code&gt; proxy headers for WebSocket handshakes. The community controller handled this automatically; F5 requires you to be explicit.&lt;/p&gt;




&lt;h2&gt;
  
  
  Zero-Downtime Service Selector Patch
&lt;/h2&gt;

&lt;p&gt;One cluster runs a secondary Service that routes specific traffic, and its label selector was hardcoded to the community controller labels:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="err"&gt;app.kubernetes.io/&lt;/span&gt;&lt;span class="py"&gt;name&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;ingress-nginx&lt;/span&gt;
&lt;span class="err"&gt;app.kubernetes.io/&lt;/span&gt;&lt;span class="py"&gt;component&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;controller&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;F5 uses &lt;code&gt;app.kubernetes.io/name=nginx-ingress&lt;/code&gt;. After migration, the service selector matched nothing — endpoints went empty, traffic dropped.&lt;/p&gt;

&lt;p&gt;A plain &lt;code&gt;kubectl apply&lt;/code&gt; won't fix this because Kubernetes rejects selector changes on existing Services. Instead, I patched it as a pre-upgrade pipeline step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl patch service &amp;lt;legacy-service-name&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; nginx-ingress &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'merge'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s1"&gt;'{
    "spec": {
      "selector": {
        "app.kubernetes.io/name": "nginx-ingress"
      }
    }
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--type='merge'&lt;/code&gt; strategy replaces only the specified keys, leaving the rest of the selector intact. Running this before &lt;code&gt;helm upgrade&lt;/code&gt; means the service selector matches the new pods the moment they come up.&lt;/p&gt;

&lt;p&gt;The broader lesson: grep for &lt;code&gt;ingress-nginx&lt;/code&gt; in &lt;strong&gt;all&lt;/strong&gt; Service selectors across your cluster before starting the migration. Any service with a hardcoded community label selector will silently drop traffic after cutover.&lt;/p&gt;




&lt;h2&gt;
  
  
  cert-manager
&lt;/h2&gt;

&lt;p&gt;One field rename in the ClusterIssuer template — &lt;code&gt;class&lt;/code&gt; is deprecated in favour of &lt;code&gt;ingressClassName&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# before&lt;/span&gt;
&lt;span class="na"&gt;solvers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;http01&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;ingress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx&lt;/span&gt;

&lt;span class="c1"&gt;# after&lt;/span&gt;
&lt;span class="na"&gt;solvers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;http01&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;ingress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;ingressClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also removed a cert-manager feature gate that was only needed to work around a community ingress-nginx bug (issue #11176) related to path type handling. F5 does not have the bug:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# removed from cert-manager values&lt;/span&gt;
&lt;span class="na"&gt;featureGates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ACMEHTTP01IngressPathTypeExact=false"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Datadog Metrics
&lt;/h2&gt;

&lt;p&gt;F5 exposes Prometheus metrics on port &lt;code&gt;9113&lt;/code&gt; (the community controller used &lt;code&gt;8080&lt;/code&gt;). The existing Datadog auto-discovery config was pointing at the wrong port. I added an OpenMetrics check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# datadog-agent values.yaml&lt;/span&gt;
&lt;span class="na"&gt;confd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;openmetrics.yaml&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
    &lt;span class="s"&gt;ad_identifiers:&lt;/span&gt;
      &lt;span class="s"&gt;- nginx-ingress&lt;/span&gt;
    &lt;span class="s"&gt;init_config:&lt;/span&gt;
    &lt;span class="s"&gt;instances:&lt;/span&gt;
      &lt;span class="s"&gt;- openmetrics_endpoint: "http://%%host%%:9113/metrics"&lt;/span&gt;
        &lt;span class="s"&gt;namespace: nginx_ingress&lt;/span&gt;
        &lt;span class="s"&gt;metrics:&lt;/span&gt;
          &lt;span class="s"&gt;- nginx_connections_accepted&lt;/span&gt;
          &lt;span class="s"&gt;- nginx_connections_active&lt;/span&gt;
          &lt;span class="s"&gt;- nginx_connections_handled&lt;/span&gt;
          &lt;span class="s"&gt;- nginx_http_requests_total&lt;/span&gt;
          &lt;span class="s"&gt;- nginx_ingress_controller_ingress_resources_total&lt;/span&gt;
          &lt;span class="s"&gt;- nginx_ingress_controller_nginx_reloads_total&lt;/span&gt;
          &lt;span class="s"&gt;- nginx_ingress_controller_nginx_reload_errors_total&lt;/span&gt;
          &lt;span class="s"&gt;- nginx_ingress_controller_nginx_last_reload_milliseconds&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things to watch: the file must be named &lt;code&gt;openmetrics.yaml&lt;/code&gt; (not &lt;code&gt;nginx-ingress.yaml&lt;/code&gt;) for Datadog's catalog to recognise it, and &lt;code&gt;ad_identifiers&lt;/code&gt; must match the container name &lt;code&gt;nginx-ingress&lt;/code&gt; exactly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Node Selector Key Update
&lt;/h2&gt;

&lt;p&gt;The community chart uses the deprecated node label key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="err"&gt;beta.kubernetes.io/&lt;/span&gt;&lt;span class="py"&gt;os&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;linux&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;F5 values use the stable GA key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="err"&gt;kubernetes.io/&lt;/span&gt;&lt;span class="py"&gt;os&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;linux&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Newer AKS node images no longer carry &lt;code&gt;beta.kubernetes.io/os&lt;/code&gt;. If your node pool has dropped it, community controller pods won't schedule. Not migration-specific, but worth cleaning up in the same PR.&lt;/p&gt;




&lt;h2&gt;
  
  
  Helm Upgrade Stability
&lt;/h2&gt;

&lt;p&gt;On cold nodes (newly scaled-up node pool), the F5 controller image pull can take longer than Helm's default 3m timeout. &lt;code&gt;--wait --timeout 5m&lt;/code&gt; prevents spurious pipeline failures that previously looked like deployment regressions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm upgrade &lt;span class="nt"&gt;--install&lt;/span&gt; nginx-ingress ./nginx-ingress-2.5.1.tgz &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; nginx-ingress &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; values.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--wait&lt;/span&gt; &lt;span class="nt"&gt;--timeout&lt;/span&gt; 5m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Rollout Issues Timeline
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Issue&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;T+0&lt;/td&gt;
&lt;td&gt;F5 crash-loops on startup&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;telemetryReporting.enable: false&lt;/code&gt; + &lt;code&gt;enableCustomResources: false&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+0&lt;/td&gt;
&lt;td&gt;Linkerd not injecting controller pods&lt;/td&gt;
&lt;td&gt;Fixed annotation path: &lt;code&gt;podAnnotations&lt;/code&gt; → &lt;code&gt;controller.pod.annotations&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+0&lt;/td&gt;
&lt;td&gt;Datadog scraping wrong port&lt;/td&gt;
&lt;td&gt;Added OpenMetrics check on port 9113&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+0&lt;/td&gt;
&lt;td&gt;Datadog system-probe seccomp failures&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;systemProbe.enabled: false&lt;/code&gt;, &lt;code&gt;discovery.enabled: false&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+1h&lt;/td&gt;
&lt;td&gt;All LB backends unhealthy&lt;/td&gt;
&lt;td&gt;Switched Azure LB probe from HTTP &lt;code&gt;/healthz&lt;/code&gt; to TCP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+2h&lt;/td&gt;
&lt;td&gt;Socket.io client disconnections&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;keepalive-timeout: 60s&lt;/code&gt;, &lt;code&gt;nginx.org/websocket-services&lt;/code&gt; annotation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+3h&lt;/td&gt;
&lt;td&gt;Secondary service endpoints empty&lt;/td&gt;
&lt;td&gt;Pre-upgrade service selector patch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+24h&lt;/td&gt;
&lt;td&gt;Helm timeout on cold nodes&lt;/td&gt;
&lt;td&gt;&lt;code&gt;--wait --timeout 5m&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T+10d&lt;/td&gt;
&lt;td&gt;IngressClass delete too aggressive in pipeline reruns&lt;/td&gt;
&lt;td&gt;Made delete conditional on &lt;code&gt;spec.controller&lt;/code&gt; value&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The conditional IngressClass delete came last because the unconditional delete worked fine on the first run — the rerun risk only became apparent during a pipeline review afterward.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Differences Cheat Sheet
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Area&lt;/th&gt;
&lt;th&gt;Community ingress-nginx&lt;/th&gt;
&lt;th&gt;F5 NGINX IC&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Helm source&lt;/td&gt;
&lt;td&gt;&lt;code&gt;kubernetes.github.io/ingress-nginx&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OCI: &lt;code&gt;ghcr.io/nginx/charts/nginx-ingress&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chart name&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ingress-nginx&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nginx-ingress&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Config structure&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;controller.config&lt;/code&gt; flat map&lt;/td&gt;
&lt;td&gt;&lt;code&gt;controller.config.entries&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rate limiting&lt;/td&gt;
&lt;td&gt;Annotations (&lt;code&gt;nginx.ingress.kubernetes.io/*&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;http-snippets&lt;/code&gt; + &lt;code&gt;server-snippets&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WebSocket&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;nginx.org/websocket-services&lt;/code&gt; required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Metrics port&lt;/td&gt;
&lt;td&gt;8080&lt;/td&gt;
&lt;td&gt;9113&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pod labels&lt;/td&gt;
&lt;td&gt;&lt;code&gt;app.kubernetes.io/name=ingress-nginx&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;app.kubernetes.io/name=nginx-ingress&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IngressClass controller field&lt;/td&gt;
&lt;td&gt;&lt;code&gt;k8s.io/ingress-nginx&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nginx.org/ingress-controller&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linkerd annotation path&lt;/td&gt;
&lt;td&gt;&lt;code&gt;podAnnotations&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;controller.pod.annotations&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Node selector key&lt;/td&gt;
&lt;td&gt;&lt;code&gt;beta.kubernetes.io/os&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;kubernetes.io/os&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Telemetry&lt;/td&gt;
&lt;td&gt;Off by default&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Must disable explicitly&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom resources&lt;/td&gt;
&lt;td&gt;Not applicable&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Must disable if not using&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LB health probe&lt;/td&gt;
&lt;td&gt;HTTP &lt;code&gt;/healthz&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;TCP only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What I Would Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Audit every config key before migrating.&lt;/strong&gt; F5 silently ignores unknown config keys. A pre-migration diff against the F5 config reference would have caught the &lt;code&gt;upstream-keepalive-*&lt;/code&gt; and &lt;code&gt;use-gzip&lt;/code&gt; removals before they hit production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test WebSocket apps on a staging cluster first.&lt;/strong&gt; The keepalive timeout issue was predictable — the default changed between controllers and I didn't check.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Grep for &lt;code&gt;ingress-nginx&lt;/code&gt; in all Service selectors before starting.&lt;/strong&gt; Any hardcoded community label selector silently drops traffic after cutover. Add the selector patch to your playbook as a standard pre-upgrade step, not a reactive fix.&lt;/p&gt;




&lt;p&gt;The migration is complete and stable across all three clusters. Ingress configurations are now easier to reason about — NGINX config is NGINX config, not a translation layer of annotations into &lt;code&gt;nginx.conf&lt;/code&gt; directives you can't see. If you're running the community chart and considering the switch, the above should give you a realistic picture of what to budget for.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>nginx</category>
      <category>aks</category>
      <category>devops</category>
    </item>
    <item>
      <title>KEDA vs Azure Functions: Choosing the Right Autoscaler for Bursty Workloads</title>
      <dc:creator>Devang Goyal</dc:creator>
      <pubDate>Sat, 16 May 2026 10:09:34 +0000</pubDate>
      <link>https://dev.to/clouddevang/keda-vs-azure-functions-choosing-the-right-autoscaler-for-bursty-workloads-249i</link>
      <guid>https://dev.to/clouddevang/keda-vs-azure-functions-choosing-the-right-autoscaler-for-bursty-workloads-249i</guid>
      <description>&lt;p&gt;When we needed to process millions of events from Azure Service Bus, the obvious choice seemed to be Azure Functions. Serverless, event-driven, automatic scaling—what's not to love? But after months of production experience, we migrated to Azure Container Apps with KEDA. Here's why, and when you might want to make the same choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Use Case: Bursty Event Processing
&lt;/h2&gt;

&lt;p&gt;Our system processed financial transactions from a message queue. The traffic pattern was extremely bursty:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Off-peak&lt;/strong&gt;: 10-50 messages per second&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Peak&lt;/strong&gt;: 10,000+ messages per second&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ramp time&lt;/strong&gt;: Bursts arrive within seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Azure Functions' scale controller is designed for this pattern. It monitors queue depth and scales out workers automatically. In theory, perfect.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problems We Encountered
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Cold Start Latency
&lt;/h3&gt;

&lt;p&gt;Azure Functions (Consumption plan) exhibited cold start times of 5-10 seconds for our .NET 6 application. During sudden bursts, the queue would accumulate thousands of messages before enough instances were warm.&lt;/p&gt;

&lt;p&gt;We tried the Premium plan, which keeps pre-warmed instances ready. This helped, but at significant cost—we were paying for idle compute 24/7.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Scaling Granularity
&lt;/h3&gt;

&lt;p&gt;The Azure Functions scale controller makes decisions based on aggregate metrics. For Service Bus, it examines message count and age. But the scaling algorithm is opaque, and we had limited control over:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scale-out threshold&lt;/strong&gt;: How many messages trigger a new instance?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale-in behavior&lt;/strong&gt;: How quickly do instances terminate?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maximum instances&lt;/strong&gt;: Hard limits that required support tickets to raise&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We needed finer control to optimize for our specific latency requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Instance Limits
&lt;/h3&gt;

&lt;p&gt;Our function sometimes needed 50+ concurrent instances to process bursts. Azure Functions has per-app limits that required special configuration. More importantly, rapid scaling caused resource contention in the underlying infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enter KEDA on Azure Container Apps
&lt;/h2&gt;

&lt;p&gt;KEDA (Kubernetes Event-driven Autoscaling) provides the same event-driven scaling but with explicit, configurable rules. Azure Container Apps integrates KEDA natively, giving us serverless simplicity with Kubernetes-level control.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Migration
&lt;/h3&gt;

&lt;p&gt;Moving from Azure Functions to Container Apps required:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Containerizing the application&lt;/strong&gt;: Our function code became a container image&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuring KEDA scalers&lt;/strong&gt;: Explicit rules for Service Bus scaling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Setting up Container Apps&lt;/strong&gt;: Managed Kubernetes without the management overhead&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's our KEDA configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;scale&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service-bus-scaler&lt;/span&gt;
      &lt;span class="na"&gt;custom&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;azure-servicebus&lt;/span&gt;
        &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;queueName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;transactions&lt;/span&gt;
          &lt;span class="na"&gt;messageCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;50"&lt;/span&gt;
          &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;our-namespace&lt;/span&gt;
        &lt;span class="na"&gt;auth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;secretRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;servicebus-connection&lt;/span&gt;
            &lt;span class="na"&gt;triggerParameter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;connection&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key differences from Azure Functions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Explicit message threshold&lt;/strong&gt;: Scale out when queue has 50+ messages (configurable)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimum replicas&lt;/strong&gt;: Always keep 2 instances warm (no cold starts)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maximum replicas&lt;/strong&gt;: Set exactly what we need, no support tickets&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Performance Comparison
&lt;/h3&gt;

&lt;p&gt;We ran identical workloads on both platforms:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Azure Functions&lt;/th&gt;
&lt;th&gt;Container Apps + KEDA&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cold start (p95)&lt;/td&gt;
&lt;td&gt;8.2 seconds&lt;/td&gt;
&lt;td&gt;0 (always warm)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scale-out time&lt;/td&gt;
&lt;td&gt;15-30 seconds&lt;/td&gt;
&lt;td&gt;5-10 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost (monthly)&lt;/td&gt;
&lt;td&gt;$2,400&lt;/td&gt;
&lt;td&gt;$1,800&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max throughput&lt;/td&gt;
&lt;td&gt;8,000 msg/sec&lt;/td&gt;
&lt;td&gt;15,000 msg/sec&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The cost reduction came from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More efficient bin-packing of containers&lt;/li&gt;
&lt;li&gt;No Premium plan pre-warm charges&lt;/li&gt;
&lt;li&gt;Faster scale-down during quiet periods&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When to Choose Azure Functions
&lt;/h2&gt;

&lt;p&gt;Azure Functions still wins for certain scenarios:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Simple HTTP APIs
&lt;/h3&gt;

&lt;p&gt;For low-traffic APIs with occasional spikes, the Consumption plan's pay-per-execution model is unbeatable. Cold starts matter less for APIs where latency is measured in seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Timer-Triggered Jobs
&lt;/h3&gt;

&lt;p&gt;Scheduled tasks that run once per hour don't need warm instances. Azure Functions' timer trigger is simpler to configure than a CronJob equivalent.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Rapid Prototyping
&lt;/h3&gt;

&lt;p&gt;When you need to deploy something quickly, Azure Functions' binding system is incredibly productive. Input/output bindings for Blob Storage, Cosmos DB, and Service Bus require minimal code.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Teams Without Container Experience
&lt;/h3&gt;

&lt;p&gt;Not every team has container expertise. Azure Functions abstracts away the infrastructure entirely, which is valuable for teams focused on business logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Choose KEDA + Container Apps
&lt;/h2&gt;

&lt;p&gt;Choose Container Apps with KEDA when:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. You Need Predictable Cold Starts
&lt;/h3&gt;

&lt;p&gt;If your SLA requires sub-second latency, keeping minimum replicas warm is essential. KEDA makes this configuration explicit.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. You Have Complex Scaling Requirements
&lt;/h3&gt;

&lt;p&gt;Multiple triggers, custom metrics, or specific threshold values require KEDA's flexibility. The scaling rules are transparent and version-controlled.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Your Workload is Container-Native
&lt;/h3&gt;

&lt;p&gt;If you're already building containers for other environments (local development, other clouds), Container Apps provides consistency without Kubernetes complexity.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Cost Optimization Matters
&lt;/h3&gt;

&lt;p&gt;For high-volume workloads, Container Apps' consumption-based billing often works out cheaper than Functions Premium. Run the numbers for your specific usage pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hybrid Approaches
&lt;/h2&gt;

&lt;p&gt;We actually use both in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Azure Functions&lt;/strong&gt;: Internal tools, scheduled jobs, low-traffic APIs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Container Apps + KEDA&lt;/strong&gt;: High-volume event processing, latency-sensitive workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The platforms aren't mutually exclusive. Choose based on the specific requirements of each workload.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation Tips
&lt;/h2&gt;

&lt;p&gt;If you're migrating from Functions to Container Apps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with KEDA documentation&lt;/strong&gt;: Understanding the scalers is crucial&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test scaling behavior&lt;/strong&gt;: Use load testing to verify your configuration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor scale events&lt;/strong&gt;: Azure Monitor shows container instance counts over time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set alerts on queue depth&lt;/strong&gt;: Catch scaling issues before they become outages&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For Service Bus specifically, configure dead-letter queue monitoring. KEDA scales based on active messages, not dead letters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Azure Functions and KEDA solve similar problems with different tradeoffs. Functions optimizes for simplicity; KEDA optimizes for control. Neither is universally better.&lt;/p&gt;

&lt;p&gt;For our bursty, latency-sensitive workload, KEDA's explicit configuration and warm instance support delivered better performance at lower cost. Your workload might be different.&lt;/p&gt;

&lt;p&gt;The best approach? Prototype both. Azure makes it easy to try Container Apps alongside Functions. Let the metrics guide your decision.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>azure</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
