Introduction
One of the most common mistakes during incident response is assuming Kubernetes is the problem simply because the application runs on Kubernetes.
A customer reports an outage.
Requests start timing out.
Applications return errors.
Transactions begin failing.
The first reaction is usually:
AKS is down.
In many cases, AKS is completely healthy.
The nodes are healthy.
The pods are healthy.
The ingress controller is healthy.
The cluster is operating exactly as designed.
The actual problem exists somewhere outside Kubernetes.
After working through enough production incidents, I have learned that many application outages blamed on Kubernetes are eventually traced back to dependencies such as:
- Databases
- Redis
- DNS
- Key Vault
- Private Endpoints
- Firewalls
- Certificates
- API Gateways
- Third-party services
Understanding how to identify these situations quickly can save hours of troubleshooting and reduce unnecessary pressure on platform teams.
Start With Evidence, Not Assumptions
The first step during any incident is determining whether Kubernetes itself is actually unhealthy.
Start with:
kubectl get nodes
Healthy output:
STATUS
Ready
Ready
Ready
Next:
kubectl get pods -A
Look for:
Running
Completed
If the cluster is healthy and workloads are running, immediately expand the investigation.
Do not spend hours troubleshooting Kubernetes simply because the application is deployed there.
The Most Misleading Incident
A common scenario looks like this:
Customers Reporting Errors
↓
Pods Running
↓
CPU Normal
↓
Memory Normal
↓
No Recent Deployment
At this point many engineers become stuck.
Everything inside Kubernetes appears healthy.
The next question should be:
What does the application depend on?
That question often reveals the actual root cause.
Azure SQL: The Silent Failure
Databases are one of the most common causes of application incidents.
Symptoms often include:
500 Errors
Timeouts
Slow Responses
Connection Failures
Meanwhile:
Pods Healthy
Example application logs:
SqlException:
Execution Timeout Expired
or
Connection Timeout Expired
Many teams immediately begin investigating AKS.
The actual issue may be:
- Database maintenance
- Query blocking
- Connection exhaustion
- Network restrictions
- Failover events
Before touching Kubernetes, verify database connectivity.
Test directly:
kubectl exec -it pod-name -- sh
Then:
nc -zv database-server 1433
or
telnet database-server 1433
The objective is determining whether the pod can reach the database.
Redis Is Running but Applications Are Failing
Redis failures are particularly deceptive.
Applications may appear healthy while functionality quietly degrades.
Examples:
Authentication Delays
Session Failures
Cache Misses
Performance Issues
Application logs may show:
ECONNRESET
or
Connection Lost
Meanwhile:
kubectl get pods
shows:
Running
This is not a Kubernetes problem.
The application is unable to communicate with Redis.
Common causes include:
- Firewall restrictions
- DNS failures
- Connection limits
- Redis maintenance
- TLS misconfiguration
DNS: The Root Cause Nobody Expects
DNS issues are among the most frustrating incidents because they often affect healthy applications.
Consider:
Application
↓
api.internal.local
If DNS resolution fails:
Connection Failed
even though:
Application Healthy
and
AKS Healthy
Inside a pod:
nslookup api.internal.local
or
dig api.internal.local
can immediately expose the problem.
Common DNS-related causes include:
- Incorrect private zones
- Broken forwarding rules
- Missing records
- CoreDNS issues
- Private Endpoint resolution failures
DNS should always be part of the investigation process.
Private Endpoints and Private DNS
Private Endpoints introduce another layer of complexity.
Example architecture:
AKS
↓
Private Endpoint
↓
Database
The database exists.
The endpoint exists.
The application still cannot connect.
Why?
Because successful Private Endpoint connectivity requires:
Private Endpoint
+
Private DNS
+
Network Routing
Missing any one component can break connectivity.
Common symptoms:
Connection Timeout
Host Not Found
Name Resolution Failure
Testing from within a pod quickly reveals whether DNS or routing is involved.
Firewalls and Network Security Rules
Network controls frequently create incidents that appear to be application failures.
Example:
Application
↓
Database
A new firewall rule is introduced.
Traffic becomes blocked.
Application logs begin showing:
Timeout
Connection Refused
The pods remain healthy.
Kubernetes remains healthy.
The network path is not.
Whenever connectivity issues occur, verify:
Source
Destination
Port
Route
Firewall
before investigating Kubernetes internals.
Key Vault Dependency Failures
Modern applications often retrieve secrets from Key Vault during startup.
If Key Vault becomes inaccessible:
Application Startup Fails
Pod status:
CrashLoopBackOff
Many engineers immediately investigate:
Container Image
Deployment YAML
Node Health
The actual issue may be:
Key Vault Access Failure
Common causes:
- Expired permissions
- Managed Identity issues
- Workload Identity issues
- Firewall restrictions
- DNS resolution failures
Always verify:
kubectl logs
before assuming Kubernetes is responsible.
Certificate Chain Problems
Certificate issues frequently appear as networking problems.
Example:
HTTPS Endpoint
returns:
TLS Handshake Failed
Application logs:
unable to verify first certificate
or
unable to get local issuer certificate
The cluster is healthy.
The application is healthy.
The certificate chain is incomplete.
Testing:
openssl s_client \
-connect endpoint:443 \
-servername endpoint
can expose these problems immediately.
API Gateways and Reverse Proxies
Applications often sit behind:
- API Gateways
- Application Gateways
- Reverse Proxies
A backend service may be healthy while traffic fails before reaching it.
Example:
Client
↓
Gateway
↓
Application
Application logs:
No Requests Received
Backend health:
Healthy
The issue may exist entirely within:
Gateway Routing
Health Probes
TLS Configuration
Investigating only the application delays resolution.
Third-Party Service Dependencies
Some incidents originate completely outside your environment.
Examples:
Payment Gateway
SMS Provider
Identity Provider
Email Provider
Application logs:
503
429
Timeout
Kubernetes cannot fix an unavailable third-party service.
Understanding external dependencies is critical during incident response.
A Practical Investigation Workflow
Whenever an application outage occurs, I follow the same sequence.
Step 1
Verify cluster health.
kubectl get nodes
Step 2
Verify pod health.
kubectl get pods
Step 3
Verify ingress.
kubectl get ingress
Step 4
Review logs.
kubectl logs
Step 5
Identify dependencies.
Examples:
SQL
Redis
Key Vault
DNS
External APIs
Step 6
Test connectivity from inside the pod.
kubectl exec
Step 7
Verify DNS resolution.
nslookup
Step 8
Verify TLS connectivity.
openssl s_client
Step 9
Validate network controls.
Firewall
NSG
Route Table
Private Endpoint
Step 10
Only investigate Kubernetes internals if evidence points there.
Common Mistakes During Incident Response
Mistake 1
Assuming Kubernetes is guilty because the application runs on Kubernetes.
Mistake 2
Ignoring application logs.
Mistake 3
Skipping connectivity tests.
Mistake 4
Never checking DNS.
Mistake 5
Ignoring certificate validation errors.
Mistake 6
Treating timeouts as Kubernetes issues.
Most timeouts originate elsewhere.
The Dependency Mindset
One useful mental model is:
Application
↓
Dependencies
The application is only as healthy as the systems it depends on.
A healthy AKS cluster cannot compensate for:
- Unreachable databases
- Broken DNS
- Invalid certificates
- Blocked firewalls
- Unavailable third-party APIs
When troubleshooting production incidents, dependencies deserve as much attention as Kubernetes itself.
Conclusion
Some of the longest outages I have investigated had nothing to do with Kubernetes.
The cluster was healthy.
The nodes were healthy.
The pods were healthy.
The actual issue was hidden behind a dependency that nobody initially considered.
Successful troubleshooting starts by eliminating assumptions.
Verify the cluster.
Verify the application.
Then systematically validate every dependency the application relies on.
Kubernetes is often the first component blamed during an outage because it is highly visible.
In many cases, it is also completely innocent.
The faster teams learn to distinguish between platform failures and dependency failures, the faster they resolve incidents and the more effective their troubleshooting process becomes.
Top comments (0)