DEV Community: Devang Goyal

Building Zero-Trust Infrastructure on Azure: A Production Story

Devang Goyal — Sat, 16 May 2026 10:44:55 +0000

When I joined the platform team at a financial services company, I inherited an infrastructure that, while functional, had significant security gaps. APIs were exposed to the public internet, database connections traversed public networks, and secret management relied on application configuration files. This is the story of how we transformed that architecture into a true zero-trust environment.

The Problem: Trust Boundaries Were Too Wide

Our initial architecture followed a common anti-pattern: everything inside the "corporate network" was trusted. Azure App Services communicated with Azure SQL over public endpoints. Key Vault secrets were fetched using connection strings stored in app settings. Storage accounts accepted requests from any IP address.

The reality of modern cloud architecture is that there is no perimeter. Zero-trust assumes that every request, whether internal or external, must be authenticated and authorized. Our infrastructure violated this principle at multiple levels.

The Architecture Redesign

1. VNet Integration for All Compute

The first major change was enabling VNet integration for every compute resource. Azure App Services, Azure Functions, and Azure Container Apps were all connected to a dedicated virtual network.

VNet Architecture:
├── Management Subnet (10.0.1.0/24)
│   └── Jumpbox, Bastion
├── App Subnet (10.0.2.0/24)
│   └── App Services, Functions
├── Container Subnet (10.0.3.0/24)
│   └── Container Apps
└── Data Subnet (10.0.4.0/24)
    └── Private Endpoints

With VNet integration, outbound traffic from our applications now routes through the virtual network, allowing us to control egress through Network Security Groups and route tables.

2. Private Endpoints for Data Services

The most critical change was eliminating public endpoints for all data services. Azure SQL, Key Vault, Storage Accounts, and Service Bus were all configured with private endpoints.

Private endpoints create a network interface inside your VNet with a private IP address. When your application connects to yourdb.database.windows.net, DNS resolution returns the private IP (e.g., 10.0.4.10) instead of the public IP.

This required careful DNS configuration:

Private DNS Zones: We created private DNS zones for each service type (privatelink.database.windows.net, privatelink.vaultcore.azure.net, etc.)
VNet Links: Each private DNS zone was linked to our VNet
Record Management: Private endpoints automatically register A records in these zones

The result: zero public database exposure. Even if an attacker compromised our application, they couldn't exfiltrate data over the internet because our SQL Server doesn't have a public IP.

3. RBAC-Enforced Key Vault Access

Instead of connection strings, we moved to managed identity authentication with RBAC. Each application is assigned a system-managed identity, and Key Vault access is granted through role assignments.

// Old approach - connection string
const client = new SecretClient(vaultUrl, new DefaultAzureCredential());

// New approach - same code, but identity is VNet-integrated
// and Key Vault only accepts requests from our VNet
const client = new SecretClient(vaultUrl, new DefaultAzureCredential());

The code didn't change, but the security posture did. Key Vault now:

Rejects requests from public internet
Only accepts requests from our VNet via private endpoint
Requires managed identity authentication (no secrets to manage)
Enforces RBAC permissions (least-privilege access)

4. Service Endpoints for Azure SQL

While private endpoints are ideal for most scenarios, we also used service endpoints for Azure SQL to provide defense in depth. Service endpoints route traffic through Azure's backbone network while allowing firewall rules at the SQL Server level.

Our SQL Server firewall configuration:

Deny public access: Toggle disabled
Virtual network rules: Allow traffic from app subnet only
Private endpoint: Primary access method

This means even if someone obtained valid credentials, they couldn't connect from outside our VNet.

Lessons Learned

DNS is Everything

The most challenging aspect wasn't the security configuration—it was DNS. When you enable private endpoints, you need to ensure that DNS resolution works correctly both from within Azure and from developer workstations.

We implemented split-brain DNS:

Inside VNet: Private DNS zones return private IPs
Outside VNet: Azure DNS returns public IPs (which are blocked by firewall)

For local development, developers connect via VPN, and their DNS queries route to Azure DNS, resolving to private endpoints.

Managed Identity Adoption Takes Time

Moving from connection strings to managed identity required updating every application. Some third-party libraries didn't support managed identity initially, requiring workarounds or upgrades.

The key was implementing changes incrementally:

Enable managed identity on the resource
Grant RBAC permissions
Update application code to use DefaultAzureCredential
Remove the old connection string
Verify with monitoring

Cost Considerations

Private endpoints aren't free. Each private endpoint incurs a small hourly cost plus data processing charges. For a large deployment with many endpoints, this adds up.

We optimized costs by:

Consolidating storage accounts where possible
Using service endpoints as a complement (free)
Implementing shared private endpoints for multi-region deployments

The Results

After implementing zero-trust architecture:

Zero public database exposure: All data services are private endpoint only
50% reduction in attack surface: No public IPs on backend infrastructure
Simplified secret management: Managed identity eliminated most secrets
Improved compliance posture: SOC 2 and PCI DSS audits became straightforward

The most important outcome wasn't technical—it was cultural. The team now defaults to private, authenticated, authorized communication for every new service. Zero-trust isn't a destination; it's a way of building systems.

Conclusion

Building zero-trust infrastructure on Azure requires careful planning, especially around networking and DNS. But the security benefits are substantial. By eliminating implicit trust and enforcing authentication at every boundary, we've created an architecture that's resilient to both external attacks and internal compromise.

If you're starting a similar journey, begin with VNet integration. Once your compute resources are in a VNet, private endpoints and RBAC become natural extensions. And remember: zero-trust is a principle, not a product. Every architecture decision should ask, "What happens if this is compromised?"

SLOs, SLIs, and Error Budgets: A Practical Guide for SREs

Devang Goyal — Sat, 16 May 2026 10:15:45 +0000

Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets form the foundation of Site Reliability Engineering. Yet many teams struggle to implement them effectively. This guide shares practical lessons from implementing SLO-based reliability practices in production financial systems.

Understanding the SRE Reliability Stack

Before diving into implementation, let's clarify the hierarchy:

SLI (Service Level Indicator): A quantitative measure of service behavior (e.g., "99.2% of requests completed in under 200ms")
SLO (Service Level Objective): The target value for an SLI (e.g., "99.9% of requests should complete in under 200ms")
SLA (Service Level Agreement): A contract with consequences for missing SLOs (e.g., "If we miss 99.9%, customers get credits")
Error Budget: The allowed failure rate (e.g., "0.1% of requests can fail per month")

Choosing the Right SLIs

The most common mistake teams make is tracking too many SLIs. Start with these four golden signals:

1. Availability

availability = successful_requests / total_requests

For an API, this might be: "Percentage of HTTP requests returning 2xx or expected 4xx status codes."

2. Latency

latency_sli = requests_under_threshold / total_requests

Track at multiple percentiles: p50 for typical experience, p99 for tail latency. For financial systems, we use p99.9.

3. Throughput

throughput = successful_requests_per_second

Critical for batch processing systems and data pipelines.

4. Error Rate

error_rate = failed_requests / total_requests

Distinguish between client errors (4xx) and server errors (5xx)—only count 5xx against your error budget.

Setting Realistic SLOs

Here's a framework I use for setting SLOs:

Step 1: Measure Current Performance

Don't guess. Run your system for 2-4 weeks and measure actual performance:

-- Example query for availability over 30 days
SELECT
  COUNT(CASE WHEN status_code < 500 THEN 1 END) * 100.0 / COUNT(*) as availability
FROM request_logs
WHERE timestamp > NOW() - INTERVAL '30 days';

Step 2: Understand User Expectations

Interview stakeholders:

What latency do users notice?
How much downtime is acceptable?
What's the business impact of degradation?

Step 3: Set Achievable Targets

If your current availability is 99.5%, don't set an SLO of 99.99%. Start with 99.7% and improve incrementally.

Pro tip: Your SLO should be slightly below your actual performance. This gives you room to experiment and deploy without constant alerts.

Implementing Error Budgets

Error budgets are the game-changer. They answer: "How much unreliability can we tolerate?"

Calculating Error Budget

For a 99.9% availability SLO over 30 days:

Error Budget = (1 - 0.999) × 30 days × 24 hours × 60 minutes
             = 0.001 × 43,200 minutes
             = 43.2 minutes of downtime allowed

Error Budget Policy

Here's the policy we implemented:

Budget Remaining	Action
> 50%	Normal development velocity
25-50%	Increased review rigor, limit risky changes
10-25%	Feature freeze, focus on reliability
< 10%	All hands on reliability, no new features

Burn Rate Alerts

Instead of alerting on instantaneous errors, alert on burn rate—how fast you're consuming your error budget:

# Prometheus alert for fast burn rate
- alert: HighErrorBudgetBurn
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
      / sum(rate(http_requests_total[1h]))
    ) > (14.4 * 0.001)  # 14.4x burn rate = budget exhausted in 5 days
  for: 5m
  labels:
    severity: critical

Real-World Implementation: A Case Study

At BitFlyer, we implemented SLOs for our trading API:

Initial State

No formal SLOs
Alerts on arbitrary thresholds
Constant alert fatigue
No clear prioritization

Implementation Steps

Week 1-2: Instrumentation
We added OpenTelemetry instrumentation to capture:

Request duration histograms
Status code counters
Dependency latencies

Week 3-4: Baseline Measurement
Measured actual performance:

Availability: 99.89%
P99 latency: 180ms
Error rate: 0.08%

Week 5-6: SLO Definition
Set initial SLOs:

Availability SLO: 99.9% (gives 43 min/month budget)
Latency SLO: 99% of requests < 200ms
Error rate SLO: < 0.1% server errors

Week 7-8: Alerting Migration
Replaced 47 arbitrary alerts with 6 SLO-based alerts:

2 availability burn rate alerts (fast/slow)
2 latency burn rate alerts (fast/slow)
2 error rate burn rate alerts (fast/slow)

Results After 3 Months

Alert volume reduced by 73%
MTTR improved by 45%
Engineering velocity increased (fewer interruptions)
Clear prioritization framework for incidents

Common Pitfalls to Avoid

1. SLO Perfection Syndrome

Don't aim for 100% availability. It's:

Mathematically impossible
Prohibitively expensive
Prevents innovation

The difference between 99.9% and 99.99% is a 10x cost increase for most systems.

2. Too Many SLOs

Start with 3-5 SLOs per service. More creates confusion and alert fatigue.

3. Ignoring Dependencies

Your service's SLO is bounded by your dependencies' SLOs. If your database has 99.9% availability, you cannot achieve 99.99% for your API.

4. Set and Forget

Review SLOs quarterly:

Are they still relevant?
Are they too tight (constant alerts) or too loose (not protecting users)?
Has the business context changed?

Tooling Recommendations

For implementing SLOs, consider:

Metrics Collection: Prometheus, Datadog, or Azure Monitor
SLO Tracking: Sloth, Google SLO Generator, or Datadog SLO
Error Budget Visualization: Grafana dashboards, custom Datadog dashboards
Alerting: PagerDuty, Opsgenie integrated with burn rate alerts

Conclusion

SLOs, SLIs, and error budgets aren't just metrics—they're a cultural shift toward data-driven reliability decisions. Start simple:

Instrument your critical paths
Measure for 2-4 weeks
Set conservative SLOs
Implement burn rate alerting
Create an error budget policy
Review and iterate quarterly

The goal isn't perfect reliability—it's appropriate reliability that balances user happiness with engineering velocity.

Have questions about implementing SLOs? Connect with me on LinkedIn or reach out via the contact form.

OpenTelemetry in Practice: Vendor-Agnostic Observability at Scale

Devang Goyal — Sat, 16 May 2026 10:15:43 +0000

When we started redesigning our customer-facing platform, observability was a first-class concern. We had been using a mix of Azure Application Insights, custom logging, and ad-hoc metrics—a common pattern that leads to gaps in visibility and vendor lock-in. This time, we chose OpenTelemetry (OTel) as our observability foundation. Here's what we learned implementing it in production.

Why OpenTelemetry?

OpenTelemetry is a CNCF project that provides vendor-neutral APIs, SDKs, and tools for collecting telemetry data. The key benefits:

Vendor Flexibility: Export to any backend (Datadog, Jaeger, Azure Monitor, etc.)
Unified API: One SDK for traces, metrics, and logs
Industry Standard: Growing ecosystem of instrumentation libraries
Future-Proof: Active community and broad industry adoption

We chose Datadog as our initial backend, but the real value is flexibility. When costs or features change, we can switch backends without rewriting instrumentation code.

The Three Pillars, Unified

OpenTelemetry handles three types of telemetry:

Traces

Distributed traces follow a request across service boundaries. Each span represents a unit of work with timing, attributes, and relationships to other spans.

Metrics

Numerical measurements like request counts, latency percentiles, and business metrics. OTel supports counters, gauges, and histograms.

Logs

Structured log records with context. OTel logs include trace context, enabling correlation between logs and traces.

Implementation Architecture

Our architecture uses the OTel Collector as a central aggregation point:

[Application] → [OTel SDK] → [OTel Collector] → [Datadog]
                                      ↓
                               [Azure Monitor] (backup)

The Collector provides:

Buffering: Handles backend unavailability
Processing: Sampling, filtering, attribute manipulation
Multi-export: Send to multiple backends simultaneously

SDK Configuration

We use the .NET OpenTelemetry SDK. Here's our configuration:

services.AddOpenTelemetry()
    .ConfigureResource(resource => resource
        .AddService(serviceName: "payment-service")
        .AddAttributes(new Dictionary<string, object>
        {
            ["environment"] = Environment.GetEnvironmentVariable("ASPNETCORE_ENVIRONMENT"),
            ["version"] = Assembly.GetExecutingAssembly().GetName().Version.ToString()
        }))
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddSqlClientInstrumentation()
        .AddSource("PaymentService")
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri("http://otel-collector:4317");
        }))
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddMeter("PaymentService")
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri("http://otel-collector:4317");
        }));

Key configuration choices:

Resource attributes: Service name, environment, and version tag every signal
Auto-instrumentation: ASP.NET Core, HttpClient, and SQL are instrumented automatically
Custom sources: Our business logic emits additional spans and metrics
OTLP export: The OpenTelemetry Protocol is the native format for the Collector

Custom Instrumentation

Auto-instrumentation covers HTTP and database calls, but business logic needs manual spans:

public class PaymentProcessor
{
    private static readonly ActivitySource ActivitySource = new("PaymentService");
    private static readonly Meter Meter = new("PaymentService");
    private static readonly Counter<long> PaymentsProcessed =
        Meter.CreateCounter<long>("payments.processed", "count");

    public async Task ProcessPayment(Payment payment)
    {
        using var activity = ActivitySource.StartActivity("ProcessPayment");
        activity?.SetTag("payment.amount", payment.Amount);
        activity?.SetTag("payment.currency", payment.Currency);

        try
        {
            // Business logic
            await ValidatePayment(payment);
            await ExecutePayment(payment);

            PaymentsProcessed.Add(1,
                new KeyValuePair<string, object>("status", "success"),
                new KeyValuePair<string, object>("currency", payment.Currency));
        }
        catch (Exception ex)
        {
            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            PaymentsProcessed.Add(1,
                new KeyValuePair<string, object>("status", "failure"));
            throw;
        }
    }
}

This creates:

A span for each payment with amount and currency attributes
A counter metric with success/failure dimensions

Structured Logging with Trace Context

OTel logs aren't just text—they're structured records with trace context:

logger.LogInformation(
    "Payment {PaymentId} processed for {Amount} {Currency}",
    payment.Id, payment.Amount, payment.Currency);

The OTel logging bridge automatically adds:

trace_id: Links this log to the active trace
span_id: Links to the specific span
severity: Derived from the log level
Structured attributes from the message template

In Datadog, clicking on a log entry shows the full trace that generated it. No correlation IDs to manage manually.

Collector Configuration

The OTel Collector is the heart of our observability pipeline:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 10000
  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
  attributes:
    actions:
      - key: team
        value: platform
        action: insert

exporters:
  datadog:
    api:
      key: ${DD_API_KEY}
  azuremonitor:
    connection_string: ${AZURE_MONITOR_CONNECTION_STRING}

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes]
      exporters: [datadog, azuremonitor]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [datadog]

Key patterns:

Batching: Reduces network overhead by sending telemetry in batches
Memory limiting: Prevents collector OOM during traffic spikes
Attribute injection: Adds consistent tags across all telemetry
Multi-export: Primary to Datadog, backup to Azure Monitor

Lessons Learned

Sampling is Essential

At scale, 100% trace sampling is expensive. We use a combination:

Head-based sampling: 10% of all traces
Tail-based sampling: 100% of error traces
Priority sampling: 100% for critical paths

The Collector's tail sampling processor examines completed traces before deciding to keep them.

Cardinality Matters

High-cardinality attributes (user IDs, request IDs) on metrics create explosion in metric storage. We learned to:

Use high-cardinality attributes only on traces
Keep metric dimensions bounded (status codes, service names, regions)
Use exemplars to link metrics to representative traces

Context Propagation is Tricky

Traces only work if context propagates correctly. We encountered issues with:

Async boundaries: Ensure activity context flows to background tasks
Message queues: Propagate trace context in message headers
Cross-language services: Use W3C Trace Context format for compatibility

Start with Auto-Instrumentation

Don't try to instrument everything manually. Start with auto-instrumentation libraries for:

HTTP servers and clients
Database clients
Message queue clients

Add custom instrumentation incrementally for business-specific visibility.

The Results

After implementing OpenTelemetry:

Mean time to detection: Reduced by 50% with correlated traces and logs
Cross-service debugging: Single trace view shows entire request flow
Backend flexibility: Successfully tested migration to alternative backends
Cost visibility: Metrics show resource consumption per feature

The most valuable outcome: when incidents occur, engineers start with a trace, not a sea of logs. Root cause identification that used to take hours now takes minutes.

Conclusion

OpenTelemetry requires upfront investment—SDK configuration, Collector deployment, team education. But the payoff is substantial: unified observability that's not locked to any vendor.

If you're starting fresh, OpenTelemetry is the clear choice. If you're migrating from a proprietary solution, start with new services and gradually expand. The ecosystem is mature enough for production use, and the community is only growing.

The future of observability is open standards. OpenTelemetry is that standard.

Migrating from Community ingress-nginx to F5 NGINX Ingress Controller Across 3 AKS Clusters

Devang Goyal — Sat, 16 May 2026 10:09:34 +0000

Earlier this month I migrated three production AKS clusters off the community ingress-nginx controller and onto the F5 NGINX Ingress Controller OSS (v2.5.1). The three workloads were a compliance API service, a real-time WebSocket trading server, and a charting frontend. Same controller name, completely different internals — and enough sharp edges to fill a post.

This is the full account: what changed, what broke, and the patterns I standardised across all three.

Why Migrate

The community Helm chart (kubernetes/ingress-nginx) and the F5 chart (nginx-stable/nginx-ingress) both proxy traffic through NGINX, but they diverge at almost every other layer — Helm structure, annotation prefixes, config key names, metrics port, and label selectors. F5 NGINX IC is the upstream-maintained version aligned with NGINX OSS releases and gives tighter control over the NGINX config without relying on the community's annotation translation layer.

The practical trigger was a mix of factors: the community chart had accumulated workarounds for bugs we no longer needed, the annotation surface was getting hard to audit, and we wanted a single, consistent ingress stack across clusters.

What Stayed the Same

Before diving into the diffs, here is what did not change:

IngressClass name remains nginx in every cluster (no application-level changes needed)
Azure Load Balancer type (internal where it was internal, public where public)
cert-manager ClusterIssuers (one field rename, covered below)
Linkerd injection on controller pods

The Migration Playbook

Every cluster followed the same five-step pipeline:

# 1. Pull the F5 chart via OCI — no helm repo add needed
helm pull oci://ghcr.io/nginx/charts/nginx-ingress \
  --version 2.5.1 \
  --destination /tmp/charts/

# 2. Verify checksum before touching anything
echo "23c866c0531719586570435a4d9a57ac0fb9661fdafd572c8916208cb7b4f225  /tmp/charts/nginx-ingress-2.5.1.tgz" \
  | sha256sum --check

# 3. One-time IngressClass migration guard
CONTROLLER=$(kubectl get ingressclass nginx \
  -o jsonpath='{.spec.controller}' 2>/dev/null || true)

if [ "${CONTROLLER}" = "k8s.io/ingress-nginx" ]; then
  echo "Removing community IngressClass — allowing F5 takeover"
  kubectl delete ingressclass nginx
fi

# 4. Helm upgrade
helm upgrade --install nginx-ingress /tmp/charts/nginx-ingress-2.5.1.tgz \
  --namespace nginx-ingress \
  -f values.yaml \
  --wait --timeout 5m

# 5. Verify the right controller is running
kubectl get pods -l app.kubernetes.io/name=nginx-ingress -n nginx-ingress

Step 3 deserves its own section.

The IngressClass Immutability Trap

spec.controller on an IngressClass resource is immutable after creation. The community controller sets it to k8s.io/ingress-nginx; the F5 controller expects nginx.org/ingress-controller. If you just run helm upgrade, F5 will fail to adopt the existing IngressClass and create a conflicting one — or worse, silently ignore it and not process any Ingress resources.

The solution is to delete the IngressClass before the first F5 install. But a naive unconditional delete is dangerous in an idempotent pipeline — if someone reruns the pipeline after migration, they'd delete the already-correct F5-owned IngressClass mid-flight, causing a brief outage.

The guard condition solves this:

if [ "${CONTROLLER}" = "k8s.io/ingress-nginx" ]; then
  kubectl delete ingressclass nginx
fi

After the first successful F5 install, spec.controller reads nginx.org/ingress-controller, so every subsequent pipeline run skips the delete. One-time, idempotent, safe.

Helm Values: Structural Differences

The community chart uses a flat controller.config map. F5 nests everything under controller.config.entries. Small diff, big gotcha if you copy-paste.

Community:

controller:
  config:
    proxy-read-timeout: "600"
    load-balance: "ewma"
    use-gzip: "true"

F5:

controller:
  config:
    entries:
      proxy-read-timeout: "600s"   # note: F5 expects the unit suffix
      lb-method: "ewma"            # key renamed
      # use-gzip has no equivalent — moved to http-snippets

A number of community config keys simply do not exist in F5 and are silently ignored if you leave them in. I audited every key against the F5 config documentation and removed: allow-snippet-annotations, allow-backend-server-header, block-user-agents, enable-vts-status, generate-request-id, limit-req-status-code, use-forwarded-headers, use-geoip, upstream-keepalive-*.

Other keys that F5 does support but with different names:

Community key	F5 equivalent
`load-balance`	`lb-method`
`proxy-read-timeout`	`proxy-read-timeout` + unit suffix
`client-header-timeout`	Move to `http-snippets`

The full base controller config across all three clusters:

controller:
  kind: deployment
  enableCustomResources: false      # not using VirtualServer CRDs
  enableSnippets: true
  telemetryReporting:
    enable: false                   # no outbound access to oss.edge.df.f5.com

  ingressClass:
    name: nginx
    create: true
    setAsDefaultIngress: false

  service:
    annotations:
      service.beta.kubernetes.io/azure-load-balancer-health-probe-protocol: tcp

  metrics:
    enable: true
    port: 9113                      # changed from community's default
    serviceMonitor:
      create: false

Two settings that tripped things up before I caught them:

telemetryReporting.enable: false — F5 attempts to phone home to oss.edge.df.f5.com. In a cluster with no outbound internet on the node pool, this causes the controller pod to crash-loop on startup waiting for the connection to time out. Must be disabled explicitly.

enableCustomResources: false — F5 ships its own CRDs (VirtualServer, TransportServer, Policy). If you leave this enabled and those CRDs aren't pre-installed, the controller crashes. Since all three clusters use standard Kubernetes Ingress resources, I disabled them entirely.

Azure LB health probe — The community controller serves /healthz on port 80. F5 does not. Azure's default HTTP probe on that path will mark all backends unhealthy. Switch to TCP probe.

Rate Limiting: From Annotations to NGINX Snippets

Community ingress-nginx ships first-class annotations for rate limiting:

# community — applied as ingress annotations
nginx.ingress.kubernetes.io/limit-req-rate: "120r/m"
nginx.ingress.kubernetes.io/limit-conn: "60"
nginx.ingress.kubernetes.io/limit-req-status: "429"

F5 NGINX IC does not have equivalent annotation primitives. The correct F5 approach is to declare the rate limit zones globally in http-snippets (controller values) and apply them per-ingress via server-snippets.

Controller values — shared zones:

controller:
  config:
    entries:
      http-snippets: |
        geo $app_limit_bypass {
          default 0;
          <office-cidr-1> 1;
          <office-cidr-2> 1;
        }

        map $app_limit_bypass $app_limit_key {
          0 $binary_remote_addr;
          1 "";
        }

        limit_req_zone  $app_limit_key zone=app_rpm:10m rate=120r/m;
        limit_conn_zone $app_limit_key zone=app_conn:10m;

Ingress manifest — apply per route:

annotations:
  nginx.org/server-snippets: |
    limit_req zone=app_rpm burst=80 nodelay;
    limit_req_status 429;
    limit_conn app_conn 60;
    limit_conn_status 429;

The geo+map pattern lets specific IP ranges (office networks, CI runners, load testing hosts) bypass rate limits by mapping to an empty key — which limit_req_zone treats as unlimited. This is cleaner than maintaining allow-lists in multiple annotation blocks across ingress manifests.

WebSocket Service: Keepalive Surprises

One of the services is a Socket.io server behind WebSocket connections. Everything looked healthy post-migration — pods up, ingress adopted — but Socket.io clients started disconnecting every 30–60 seconds.

The root cause: F5's default keepalive-timeout is 0s (disabled), whereas the community chart defaults to 60s. WebSocket connections through NGINX depend on TCP keepalive to stay alive during idle periods. With keepalive disabled, NGINX was closing the connection server-side.

Fix:

controller:
  config:
    entries:
      keepalive-timeout: "60s"
      http2: "false"   # HTTP/2 and WebSocket upgrades conflict; disable explicitly

Also required adding the F5 WebSocket annotation to the ingress manifest:

annotations:
  nginx.org/websocket-services: "my-websocket-service"

Without this annotation, F5 does not set the necessary Upgrade and Connection proxy headers for WebSocket handshakes. The community controller handled this automatically; F5 requires you to be explicit.

Zero-Downtime Service Selector Patch

One cluster runs a secondary Service that routes specific traffic, and its label selector was hardcoded to the community controller labels:

app.kubernetes.io/name=ingress-nginx
app.kubernetes.io/component=controller

F5 uses app.kubernetes.io/name=nginx-ingress. After migration, the service selector matched nothing — endpoints went empty, traffic dropped.

A plain kubectl apply won't fix this because Kubernetes rejects selector changes on existing Services. Instead, I patched it as a pre-upgrade pipeline step:

kubectl patch service <legacy-service-name> \
  -n nginx-ingress \
  --type='merge' \
  -p '{
    "spec": {
      "selector": {
        "app.kubernetes.io/name": "nginx-ingress"
      }
    }
  }'

The --type='merge' strategy replaces only the specified keys, leaving the rest of the selector intact. Running this before helm upgrade means the service selector matches the new pods the moment they come up.

The broader lesson: grep for ingress-nginx in all Service selectors across your cluster before starting the migration. Any service with a hardcoded community label selector will silently drop traffic after cutover.

cert-manager

One field rename in the ClusterIssuer template — class is deprecated in favour of ingressClassName:

# before
solvers:
  - http01:
      ingress:
        class: nginx

# after
solvers:
  - http01:
      ingress:
        ingressClassName: nginx

Also removed a cert-manager feature gate that was only needed to work around a community ingress-nginx bug (issue #11176) related to path type handling. F5 does not have the bug:

# removed from cert-manager values
featureGates: "ACMEHTTP01IngressPathTypeExact=false"

Datadog Metrics

F5 exposes Prometheus metrics on port 9113 (the community controller used 8080). The existing Datadog auto-discovery config was pointing at the wrong port. I added an OpenMetrics check:

# datadog-agent values.yaml
confd:
  openmetrics.yaml: |-
    ad_identifiers:
      - nginx-ingress
    init_config:
    instances:
      - openmetrics_endpoint: "http://%%host%%:9113/metrics"
        namespace: nginx_ingress
        metrics:
          - nginx_connections_accepted
          - nginx_connections_active
          - nginx_connections_handled
          - nginx_http_requests_total
          - nginx_ingress_controller_ingress_resources_total
          - nginx_ingress_controller_nginx_reloads_total
          - nginx_ingress_controller_nginx_reload_errors_total
          - nginx_ingress_controller_nginx_last_reload_milliseconds

Two things to watch: the file must be named openmetrics.yaml (not nginx-ingress.yaml) for Datadog's catalog to recognise it, and ad_identifiers must match the container name nginx-ingress exactly.

Node Selector Key Update

The community chart uses the deprecated node label key:

beta.kubernetes.io/os=linux

F5 values use the stable GA key:

kubernetes.io/os=linux

Newer AKS node images no longer carry beta.kubernetes.io/os. If your node pool has dropped it, community controller pods won't schedule. Not migration-specific, but worth cleaning up in the same PR.

Helm Upgrade Stability

On cold nodes (newly scaled-up node pool), the F5 controller image pull can take longer than Helm's default 3m timeout. --wait --timeout 5m prevents spurious pipeline failures that previously looked like deployment regressions:

helm upgrade --install nginx-ingress ./nginx-ingress-2.5.1.tgz \
  --namespace nginx-ingress \
  -f values.yaml \
  --wait --timeout 5m

Rollout Issues Timeline

Time	Issue	Fix
T+0	F5 crash-loops on startup	`telemetryReporting.enable: false` + `enableCustomResources: false`
T+0	Linkerd not injecting controller pods	Fixed annotation path: `podAnnotations` → `controller.pod.annotations`
T+0	Datadog scraping wrong port	Added OpenMetrics check on port 9113
T+0	Datadog system-probe seccomp failures	`systemProbe.enabled: false`, `discovery.enabled: false`
T+1h	All LB backends unhealthy	Switched Azure LB probe from HTTP `/healthz` to TCP
T+2h	Socket.io client disconnections	`keepalive-timeout: 60s`, `nginx.org/websocket-services` annotation
T+3h	Secondary service endpoints empty	Pre-upgrade service selector patch
T+24h	Helm timeout on cold nodes	`--wait --timeout 5m`
T+10d	IngressClass delete too aggressive in pipeline reruns	Made delete conditional on `spec.controller` value

The conditional IngressClass delete came last because the unconditional delete worked fine on the first run — the rerun risk only became apparent during a pipeline review afterward.

Key Differences Cheat Sheet

Area	Community ingress-nginx	F5 NGINX IC
Helm source	`kubernetes.github.io/ingress-nginx`	OCI: `ghcr.io/nginx/charts/nginx-ingress`
Chart name	`ingress-nginx`	`nginx-ingress`
Config structure	`controller.config` flat map	`controller.config.entries`
Rate limiting	Annotations (`nginx.ingress.kubernetes.io/*`)	`http-snippets` + `server-snippets`
WebSocket	Automatic	`nginx.org/websocket-services` required
Metrics port	8080	9113
Pod labels	`app.kubernetes.io/name=ingress-nginx`	`app.kubernetes.io/name=nginx-ingress`
IngressClass controller field	`k8s.io/ingress-nginx`	`nginx.org/ingress-controller`
Linkerd annotation path	`podAnnotations`	`controller.pod.annotations`
Node selector key	`beta.kubernetes.io/os`	`kubernetes.io/os`
Telemetry	Off by default	Must disable explicitly
Custom resources	Not applicable	Must disable if not using
LB health probe	HTTP `/healthz`	TCP only

What I Would Do Differently

Audit every config key before migrating. F5 silently ignores unknown config keys. A pre-migration diff against the F5 config reference would have caught the upstream-keepalive-* and use-gzip removals before they hit production.

Test WebSocket apps on a staging cluster first. The keepalive timeout issue was predictable — the default changed between controllers and I didn't check.

Grep for ingress-nginx in all Service selectors before starting. Any hardcoded community label selector silently drops traffic after cutover. Add the selector patch to your playbook as a standard pre-upgrade step, not a reactive fix.

The migration is complete and stable across all three clusters. Ingress configurations are now easier to reason about — NGINX config is NGINX config, not a translation layer of annotations into nginx.conf directives you can't see. If you're running the community chart and considering the switch, the above should give you a realistic picture of what to budget for.

KEDA vs Azure Functions: Choosing the Right Autoscaler for Bursty Workloads

Devang Goyal — Sat, 16 May 2026 10:09:34 +0000

When we needed to process millions of events from Azure Service Bus, the obvious choice seemed to be Azure Functions. Serverless, event-driven, automatic scaling—what's not to love? But after months of production experience, we migrated to Azure Container Apps with KEDA. Here's why, and when you might want to make the same choice.

The Use Case: Bursty Event Processing

Our system processed financial transactions from a message queue. The traffic pattern was extremely bursty:

Off-peak: 10-50 messages per second
Peak: 10,000+ messages per second
Ramp time: Bursts arrive within seconds

Azure Functions' scale controller is designed for this pattern. It monitors queue depth and scales out workers automatically. In theory, perfect.

The Problems We Encountered

1. Cold Start Latency

Azure Functions (Consumption plan) exhibited cold start times of 5-10 seconds for our .NET 6 application. During sudden bursts, the queue would accumulate thousands of messages before enough instances were warm.

We tried the Premium plan, which keeps pre-warmed instances ready. This helped, but at significant cost—we were paying for idle compute 24/7.

2. Scaling Granularity

The Azure Functions scale controller makes decisions based on aggregate metrics. For Service Bus, it examines message count and age. But the scaling algorithm is opaque, and we had limited control over:

Scale-out threshold: How many messages trigger a new instance?
Scale-in behavior: How quickly do instances terminate?
Maximum instances: Hard limits that required support tickets to raise

We needed finer control to optimize for our specific latency requirements.

3. Instance Limits

Our function sometimes needed 50+ concurrent instances to process bursts. Azure Functions has per-app limits that required special configuration. More importantly, rapid scaling caused resource contention in the underlying infrastructure.

Enter KEDA on Azure Container Apps

KEDA (Kubernetes Event-driven Autoscaling) provides the same event-driven scaling but with explicit, configurable rules. Azure Container Apps integrates KEDA natively, giving us serverless simplicity with Kubernetes-level control.

The Migration

Moving from Azure Functions to Container Apps required:

Containerizing the application: Our function code became a container image
Configuring KEDA scalers: Explicit rules for Service Bus scaling
Setting up Container Apps: Managed Kubernetes without the management overhead

Here's our KEDA configuration:

scale:
  minReplicas: 2
  maxReplicas: 100
  rules:
    - name: service-bus-scaler
      custom:
        type: azure-servicebus
        metadata:
          queueName: transactions
          messageCount: "50"
          namespace: our-namespace
        auth:
          - secretRef: servicebus-connection
            triggerParameter: connection

Key differences from Azure Functions:

Explicit message threshold: Scale out when queue has 50+ messages (configurable)
Minimum replicas: Always keep 2 instances warm (no cold starts)
Maximum replicas: Set exactly what we need, no support tickets

Performance Comparison

We ran identical workloads on both platforms:

Metric	Azure Functions	Container Apps + KEDA
Cold start (p95)	8.2 seconds	0 (always warm)
Scale-out time	15-30 seconds	5-10 seconds
Cost (monthly)	$2,400	$1,800
Max throughput	8,000 msg/sec	15,000 msg/sec

The cost reduction came from:

More efficient bin-packing of containers
No Premium plan pre-warm charges
Faster scale-down during quiet periods

When to Choose Azure Functions

Azure Functions still wins for certain scenarios:

1. Simple HTTP APIs

For low-traffic APIs with occasional spikes, the Consumption plan's pay-per-execution model is unbeatable. Cold starts matter less for APIs where latency is measured in seconds.

2. Timer-Triggered Jobs

Scheduled tasks that run once per hour don't need warm instances. Azure Functions' timer trigger is simpler to configure than a CronJob equivalent.

3. Rapid Prototyping

When you need to deploy something quickly, Azure Functions' binding system is incredibly productive. Input/output bindings for Blob Storage, Cosmos DB, and Service Bus require minimal code.

4. Teams Without Container Experience

Not every team has container expertise. Azure Functions abstracts away the infrastructure entirely, which is valuable for teams focused on business logic.

When to Choose KEDA + Container Apps

Choose Container Apps with KEDA when:

1. You Need Predictable Cold Starts

If your SLA requires sub-second latency, keeping minimum replicas warm is essential. KEDA makes this configuration explicit.

2. You Have Complex Scaling Requirements

Multiple triggers, custom metrics, or specific threshold values require KEDA's flexibility. The scaling rules are transparent and version-controlled.

3. Your Workload is Container-Native

If you're already building containers for other environments (local development, other clouds), Container Apps provides consistency without Kubernetes complexity.

4. Cost Optimization Matters

For high-volume workloads, Container Apps' consumption-based billing often works out cheaper than Functions Premium. Run the numbers for your specific usage pattern.

Hybrid Approaches

We actually use both in production:

Azure Functions: Internal tools, scheduled jobs, low-traffic APIs
Container Apps + KEDA: High-volume event processing, latency-sensitive workloads

The platforms aren't mutually exclusive. Choose based on the specific requirements of each workload.

Implementation Tips

If you're migrating from Functions to Container Apps:

Start with KEDA documentation: Understanding the scalers is crucial
Test scaling behavior: Use load testing to verify your configuration
Monitor scale events: Azure Monitor shows container instance counts over time
Set alerts on queue depth: Catch scaling issues before they become outages

For Service Bus specifically, configure dead-letter queue monitoring. KEDA scales based on active messages, not dead letters.

Conclusion

Azure Functions and KEDA solve similar problems with different tradeoffs. Functions optimizes for simplicity; KEDA optimizes for control. Neither is universally better.

For our bursty, latency-sensitive workload, KEDA's explicit configuration and warm instance support delivered better performance at lower cost. Your workload might be different.

The best approach? Prototype both. Azure makes it easy to try Container Apps alongside Functions. Let the metrics guide your decision.