Falolu Olaitan

Posted on Jun 12

When Kubernetes Isn't the Problem: Debugging External Dependencies in AKS

#redis #sql #azure #kubernetes

Introduction

One of the most common mistakes during incident response is assuming Kubernetes is the problem simply because the application runs on Kubernetes.

A customer reports an outage.

Requests start timing out.

Applications return errors.

Transactions begin failing.

The first reaction is usually:

AKS is down.

In many cases, AKS is completely healthy.

The nodes are healthy.

The pods are healthy.

The ingress controller is healthy.

The cluster is operating exactly as designed.

The actual problem exists somewhere outside Kubernetes.

After working through enough production incidents, I have learned that many application outages blamed on Kubernetes are eventually traced back to dependencies such as:

Databases
Redis
DNS
Key Vault
Private Endpoints
Firewalls
Certificates
API Gateways
Third-party services

Understanding how to identify these situations quickly can save hours of troubleshooting and reduce unnecessary pressure on platform teams.

Start With Evidence, Not Assumptions

The first step during any incident is determining whether Kubernetes itself is actually unhealthy.

Start with:

kubectl get nodes

Healthy output:

STATUS
Ready
Ready
Ready

kubectl get pods -A

Look for:

Running
Completed

If the cluster is healthy and workloads are running, immediately expand the investigation.

Do not spend hours troubleshooting Kubernetes simply because the application is deployed there.

The Most Misleading Incident

A common scenario looks like this:

Customers Reporting Errors
          ↓
Pods Running
          ↓
CPU Normal
          ↓
Memory Normal
          ↓
No Recent Deployment

At this point many engineers become stuck.

Everything inside Kubernetes appears healthy.

The next question should be:

What does the application depend on?

That question often reveals the actual root cause.

Azure SQL: The Silent Failure

Databases are one of the most common causes of application incidents.

Symptoms often include:

500 Errors
Timeouts
Slow Responses
Connection Failures

Meanwhile:

Pods Healthy

Example application logs:

SqlException:
Execution Timeout Expired

Connection Timeout Expired

Many teams immediately begin investigating AKS.

The actual issue may be:

Database maintenance
Query blocking
Connection exhaustion
Network restrictions
Failover events

Before touching Kubernetes, verify database connectivity.

Test directly:

kubectl exec -it pod-name -- sh

Then:

nc -zv database-server 1433

telnet database-server 1433

The objective is determining whether the pod can reach the database.

Redis Is Running but Applications Are Failing

Redis failures are particularly deceptive.

Applications may appear healthy while functionality quietly degrades.

Examples:

Authentication Delays
Session Failures
Cache Misses
Performance Issues

Application logs may show:

ECONNRESET

Connection Lost

Meanwhile:

kubectl get pods

shows:

Running

This is not a Kubernetes problem.

The application is unable to communicate with Redis.

Common causes include:

Firewall restrictions
DNS failures
Connection limits
Redis maintenance
TLS misconfiguration

DNS: The Root Cause Nobody Expects

DNS issues are among the most frustrating incidents because they often affect healthy applications.

Consider:

Application
      ↓
api.internal.local

If DNS resolution fails:

Connection Failed

even though:

Application Healthy

and

AKS Healthy

Inside a pod:

nslookup api.internal.local

dig api.internal.local

can immediately expose the problem.

Common DNS-related causes include:

Incorrect private zones
Broken forwarding rules
Missing records
CoreDNS issues
Private Endpoint resolution failures

DNS should always be part of the investigation process.

Private Endpoints and Private DNS

Private Endpoints introduce another layer of complexity.

Example architecture:

AKS
 ↓
Private Endpoint
 ↓
Database

The database exists.

The endpoint exists.

The application still cannot connect.

Why?

Because successful Private Endpoint connectivity requires:

Private Endpoint
      +
Private DNS
      +
Network Routing

Missing any one component can break connectivity.

Common symptoms:

Connection Timeout
Host Not Found
Name Resolution Failure

Testing from within a pod quickly reveals whether DNS or routing is involved.

Firewalls and Network Security Rules

Network controls frequently create incidents that appear to be application failures.

Example:

Application
      ↓
Database

A new firewall rule is introduced.

Traffic becomes blocked.

Application logs begin showing:

Timeout
Connection Refused

The pods remain healthy.

Kubernetes remains healthy.

The network path is not.

Whenever connectivity issues occur, verify:

Source
Destination
Port
Route
Firewall

before investigating Kubernetes internals.

Key Vault Dependency Failures

Modern applications often retrieve secrets from Key Vault during startup.

If Key Vault becomes inaccessible:

Application Startup Fails

Pod status:

CrashLoopBackOff

Many engineers immediately investigate:

Container Image
Deployment YAML
Node Health

The actual issue may be:

Key Vault Access Failure

Common causes:

Expired permissions
Managed Identity issues
Workload Identity issues
Firewall restrictions
DNS resolution failures

Always verify:

kubectl logs

before assuming Kubernetes is responsible.

Certificate Chain Problems

Certificate issues frequently appear as networking problems.

Example:

HTTPS Endpoint

returns:

TLS Handshake Failed

Application logs:

unable to verify first certificate

unable to get local issuer certificate

The cluster is healthy.

The application is healthy.

The certificate chain is incomplete.

Testing:

openssl s_client \
-connect endpoint:443 \
-servername endpoint

can expose these problems immediately.

API Gateways and Reverse Proxies

Applications often sit behind:

API Gateways
Application Gateways
Reverse Proxies

A backend service may be healthy while traffic fails before reaching it.

Example:

Client
 ↓
Gateway
 ↓
Application

Application logs:

No Requests Received

Backend health:

Healthy

The issue may exist entirely within:

Gateway Routing
Health Probes
TLS Configuration

Investigating only the application delays resolution.

Third-Party Service Dependencies

Some incidents originate completely outside your environment.

Examples:

Payment Gateway
SMS Provider
Identity Provider
Email Provider

Application logs:

503
429
Timeout

Kubernetes cannot fix an unavailable third-party service.

Understanding external dependencies is critical during incident response.

A Practical Investigation Workflow

Whenever an application outage occurs, I follow the same sequence.

Step 1

Verify cluster health.

kubectl get nodes

Step 2

Verify pod health.

kubectl get pods

Step 3

Verify ingress.

kubectl get ingress

Step 4

Review logs.

kubectl logs

Step 5

Identify dependencies.

Examples:

SQL
Redis
Key Vault
DNS
External APIs

Step 6

Test connectivity from inside the pod.

kubectl exec

Step 7

Verify DNS resolution.

nslookup

Step 8

Verify TLS connectivity.

openssl s_client

Step 9

Validate network controls.

Firewall
NSG
Route Table
Private Endpoint

Step 10

Only investigate Kubernetes internals if evidence points there.

Common Mistakes During Incident Response

Mistake 1

Assuming Kubernetes is guilty because the application runs on Kubernetes.

Mistake 2

Ignoring application logs.

Mistake 3

Skipping connectivity tests.

Mistake 4

Never checking DNS.

Mistake 5

Ignoring certificate validation errors.

Mistake 6

Treating timeouts as Kubernetes issues.

Most timeouts originate elsewhere.

The Dependency Mindset

One useful mental model is:

Application
     ↓
Dependencies

The application is only as healthy as the systems it depends on.

A healthy AKS cluster cannot compensate for:

Unreachable databases
Broken DNS
Invalid certificates
Blocked firewalls
Unavailable third-party APIs

When troubleshooting production incidents, dependencies deserve as much attention as Kubernetes itself.

Conclusion

Some of the longest outages I have investigated had nothing to do with Kubernetes.

The cluster was healthy.

The nodes were healthy.

The pods were healthy.

The actual issue was hidden behind a dependency that nobody initially considered.

Successful troubleshooting starts by eliminating assumptions.

Verify the cluster.

Verify the application.

Then systematically validate every dependency the application relies on.

Kubernetes is often the first component blamed during an outage because it is highly visible.

In many cases, it is also completely innocent.

The faster teams learn to distinguish between platform failures and dependency failures, the faster they resolve incidents and the more effective their troubleshooting process becomes.

DEV Community

When Kubernetes Isn't the Problem: Debugging External Dependencies in AKS

Introduction

Start With Evidence, Not Assumptions

The Most Misleading Incident

Azure SQL: The Silent Failure

Redis Is Running but Applications Are Failing

DNS: The Root Cause Nobody Expects

Private Endpoints and Private DNS

Firewalls and Network Security Rules

Key Vault Dependency Failures

Certificate Chain Problems

API Gateways and Reverse Proxies

Third-Party Service Dependencies

A Practical Investigation Workflow

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Step 7

Step 8

Step 9

Step 10

Common Mistakes During Incident Response

Mistake 1

Mistake 2

Mistake 3

Mistake 4

Mistake 5

Mistake 6

The Dependency Mindset

Conclusion

Top comments (0)