DEV Community

Falolu Olaitan
Falolu Olaitan

Posted on

When Kubernetes Isn't the Problem: Debugging External Dependencies in AKS

Introduction

One of the most common mistakes during incident response is assuming Kubernetes is the problem simply because the application runs on Kubernetes.

A customer reports an outage.

Requests start timing out.

Applications return errors.

Transactions begin failing.

The first reaction is usually:

AKS is down.
Enter fullscreen mode Exit fullscreen mode

In many cases, AKS is completely healthy.

The nodes are healthy.

The pods are healthy.

The ingress controller is healthy.

The cluster is operating exactly as designed.

The actual problem exists somewhere outside Kubernetes.

After working through enough production incidents, I have learned that many application outages blamed on Kubernetes are eventually traced back to dependencies such as:

  • Databases
  • Redis
  • DNS
  • Key Vault
  • Private Endpoints
  • Firewalls
  • Certificates
  • API Gateways
  • Third-party services

Understanding how to identify these situations quickly can save hours of troubleshooting and reduce unnecessary pressure on platform teams.


Start With Evidence, Not Assumptions

The first step during any incident is determining whether Kubernetes itself is actually unhealthy.

Start with:

kubectl get nodes
Enter fullscreen mode Exit fullscreen mode

Healthy output:

STATUS
Ready
Ready
Ready
Enter fullscreen mode Exit fullscreen mode

Next:

kubectl get pods -A
Enter fullscreen mode Exit fullscreen mode

Look for:

Running
Completed
Enter fullscreen mode Exit fullscreen mode

If the cluster is healthy and workloads are running, immediately expand the investigation.

Do not spend hours troubleshooting Kubernetes simply because the application is deployed there.


The Most Misleading Incident

A common scenario looks like this:

Customers Reporting Errors
          ↓
Pods Running
          ↓
CPU Normal
          ↓
Memory Normal
          ↓
No Recent Deployment
Enter fullscreen mode Exit fullscreen mode

At this point many engineers become stuck.

Everything inside Kubernetes appears healthy.

The next question should be:

What does the application depend on?
Enter fullscreen mode Exit fullscreen mode

That question often reveals the actual root cause.


Azure SQL: The Silent Failure

Databases are one of the most common causes of application incidents.

Symptoms often include:

500 Errors
Timeouts
Slow Responses
Connection Failures
Enter fullscreen mode Exit fullscreen mode

Meanwhile:

Pods Healthy
Enter fullscreen mode Exit fullscreen mode

Example application logs:

SqlException:
Execution Timeout Expired
Enter fullscreen mode Exit fullscreen mode

or

Connection Timeout Expired
Enter fullscreen mode Exit fullscreen mode

Many teams immediately begin investigating AKS.

The actual issue may be:

  • Database maintenance
  • Query blocking
  • Connection exhaustion
  • Network restrictions
  • Failover events

Before touching Kubernetes, verify database connectivity.

Test directly:

kubectl exec -it pod-name -- sh
Enter fullscreen mode Exit fullscreen mode

Then:

nc -zv database-server 1433
Enter fullscreen mode Exit fullscreen mode

or

telnet database-server 1433
Enter fullscreen mode Exit fullscreen mode

The objective is determining whether the pod can reach the database.


Redis Is Running but Applications Are Failing

Redis failures are particularly deceptive.

Applications may appear healthy while functionality quietly degrades.

Examples:

Authentication Delays
Session Failures
Cache Misses
Performance Issues
Enter fullscreen mode Exit fullscreen mode

Application logs may show:

ECONNRESET
Enter fullscreen mode Exit fullscreen mode

or

Connection Lost
Enter fullscreen mode Exit fullscreen mode

Meanwhile:

kubectl get pods
Enter fullscreen mode Exit fullscreen mode

shows:

Running
Enter fullscreen mode Exit fullscreen mode

This is not a Kubernetes problem.

The application is unable to communicate with Redis.

Common causes include:

  • Firewall restrictions
  • DNS failures
  • Connection limits
  • Redis maintenance
  • TLS misconfiguration

DNS: The Root Cause Nobody Expects

DNS issues are among the most frustrating incidents because they often affect healthy applications.

Consider:

Application
      ↓
api.internal.local
Enter fullscreen mode Exit fullscreen mode

If DNS resolution fails:

Connection Failed
Enter fullscreen mode Exit fullscreen mode

even though:

Application Healthy
Enter fullscreen mode Exit fullscreen mode

and

AKS Healthy
Enter fullscreen mode Exit fullscreen mode

Inside a pod:

nslookup api.internal.local
Enter fullscreen mode Exit fullscreen mode

or

dig api.internal.local
Enter fullscreen mode Exit fullscreen mode

can immediately expose the problem.

Common DNS-related causes include:

  • Incorrect private zones
  • Broken forwarding rules
  • Missing records
  • CoreDNS issues
  • Private Endpoint resolution failures

DNS should always be part of the investigation process.


Private Endpoints and Private DNS

Private Endpoints introduce another layer of complexity.

Example architecture:

AKS
 ↓
Private Endpoint
 ↓
Database
Enter fullscreen mode Exit fullscreen mode

The database exists.

The endpoint exists.

The application still cannot connect.

Why?

Because successful Private Endpoint connectivity requires:

Private Endpoint
      +
Private DNS
      +
Network Routing
Enter fullscreen mode Exit fullscreen mode

Missing any one component can break connectivity.

Common symptoms:

Connection Timeout
Host Not Found
Name Resolution Failure
Enter fullscreen mode Exit fullscreen mode

Testing from within a pod quickly reveals whether DNS or routing is involved.


Firewalls and Network Security Rules

Network controls frequently create incidents that appear to be application failures.

Example:

Application
      ↓
Database
Enter fullscreen mode Exit fullscreen mode

A new firewall rule is introduced.

Traffic becomes blocked.

Application logs begin showing:

Timeout
Connection Refused
Enter fullscreen mode Exit fullscreen mode

The pods remain healthy.

Kubernetes remains healthy.

The network path is not.

Whenever connectivity issues occur, verify:

Source
Destination
Port
Route
Firewall
Enter fullscreen mode Exit fullscreen mode

before investigating Kubernetes internals.


Key Vault Dependency Failures

Modern applications often retrieve secrets from Key Vault during startup.

If Key Vault becomes inaccessible:

Application Startup Fails
Enter fullscreen mode Exit fullscreen mode

Pod status:

CrashLoopBackOff
Enter fullscreen mode Exit fullscreen mode

Many engineers immediately investigate:

Container Image
Deployment YAML
Node Health
Enter fullscreen mode Exit fullscreen mode

The actual issue may be:

Key Vault Access Failure
Enter fullscreen mode Exit fullscreen mode

Common causes:

  • Expired permissions
  • Managed Identity issues
  • Workload Identity issues
  • Firewall restrictions
  • DNS resolution failures

Always verify:

kubectl logs
Enter fullscreen mode Exit fullscreen mode

before assuming Kubernetes is responsible.


Certificate Chain Problems

Certificate issues frequently appear as networking problems.

Example:

HTTPS Endpoint
Enter fullscreen mode Exit fullscreen mode

returns:

TLS Handshake Failed
Enter fullscreen mode Exit fullscreen mode

Application logs:

unable to verify first certificate
Enter fullscreen mode Exit fullscreen mode

or

unable to get local issuer certificate
Enter fullscreen mode Exit fullscreen mode

The cluster is healthy.

The application is healthy.

The certificate chain is incomplete.

Testing:

openssl s_client \
-connect endpoint:443 \
-servername endpoint
Enter fullscreen mode Exit fullscreen mode

can expose these problems immediately.


API Gateways and Reverse Proxies

Applications often sit behind:

  • API Gateways
  • Application Gateways
  • Reverse Proxies

A backend service may be healthy while traffic fails before reaching it.

Example:

Client
 ↓
Gateway
 ↓
Application
Enter fullscreen mode Exit fullscreen mode

Application logs:

No Requests Received
Enter fullscreen mode Exit fullscreen mode

Backend health:

Healthy
Enter fullscreen mode Exit fullscreen mode

The issue may exist entirely within:

Gateway Routing
Health Probes
TLS Configuration
Enter fullscreen mode Exit fullscreen mode

Investigating only the application delays resolution.


Third-Party Service Dependencies

Some incidents originate completely outside your environment.

Examples:

Payment Gateway
SMS Provider
Identity Provider
Email Provider
Enter fullscreen mode Exit fullscreen mode

Application logs:

503
429
Timeout
Enter fullscreen mode Exit fullscreen mode

Kubernetes cannot fix an unavailable third-party service.

Understanding external dependencies is critical during incident response.


A Practical Investigation Workflow

Whenever an application outage occurs, I follow the same sequence.

Step 1

Verify cluster health.

kubectl get nodes
Enter fullscreen mode Exit fullscreen mode

Step 2

Verify pod health.

kubectl get pods
Enter fullscreen mode Exit fullscreen mode

Step 3

Verify ingress.

kubectl get ingress
Enter fullscreen mode Exit fullscreen mode

Step 4

Review logs.

kubectl logs
Enter fullscreen mode Exit fullscreen mode

Step 5

Identify dependencies.

Examples:

SQL
Redis
Key Vault
DNS
External APIs
Enter fullscreen mode Exit fullscreen mode

Step 6

Test connectivity from inside the pod.

kubectl exec
Enter fullscreen mode Exit fullscreen mode

Step 7

Verify DNS resolution.

nslookup
Enter fullscreen mode Exit fullscreen mode

Step 8

Verify TLS connectivity.

openssl s_client
Enter fullscreen mode Exit fullscreen mode

Step 9

Validate network controls.

Firewall
NSG
Route Table
Private Endpoint
Enter fullscreen mode Exit fullscreen mode

Step 10

Only investigate Kubernetes internals if evidence points there.


Common Mistakes During Incident Response

Mistake 1

Assuming Kubernetes is guilty because the application runs on Kubernetes.


Mistake 2

Ignoring application logs.


Mistake 3

Skipping connectivity tests.


Mistake 4

Never checking DNS.


Mistake 5

Ignoring certificate validation errors.


Mistake 6

Treating timeouts as Kubernetes issues.

Most timeouts originate elsewhere.


The Dependency Mindset

One useful mental model is:

Application
     ↓
Dependencies
Enter fullscreen mode Exit fullscreen mode

The application is only as healthy as the systems it depends on.

A healthy AKS cluster cannot compensate for:

  • Unreachable databases
  • Broken DNS
  • Invalid certificates
  • Blocked firewalls
  • Unavailable third-party APIs

When troubleshooting production incidents, dependencies deserve as much attention as Kubernetes itself.


Conclusion

Some of the longest outages I have investigated had nothing to do with Kubernetes.

The cluster was healthy.

The nodes were healthy.

The pods were healthy.

The actual issue was hidden behind a dependency that nobody initially considered.

Successful troubleshooting starts by eliminating assumptions.

Verify the cluster.

Verify the application.

Then systematically validate every dependency the application relies on.

Kubernetes is often the first component blamed during an outage because it is highly visible.

In many cases, it is also completely innocent.

The faster teams learn to distinguish between platform failures and dependency failures, the faster they resolve incidents and the more effective their troubleshooting process becomes.

Top comments (0)