A production Kubernetes application started showing latency issues during peak hours. User reports flagged slow page loads and inconsistent response times.
The initial reaction from the infrastructure team was to add more nodes to the cluster. However, before provisioning additional compute resources, a deeper inspection was performed.
But throwing compute at a latency issue is inefficient and costly.
Root Causes Identified:
- Too many service hops
- CoreDNS misconfigurations
- No caching for repeated API calls
Real Solutions (Not More Nodes)
1. Use a Service Mesh
Why:
Service meshes like Istio or Linkerd reduce latency by enabling intelligent routing, retries, timeouts, and circuit breaking — optimizing pod-to-pod communication.
Commands (Istio example):
# Install Istio
istioctl install --set profile=demo -y
# Enable automatic sidecar injection
kubectl label namespace default istio-injection=enabled
# Deploy your app with mesh support
kubectl apply -f your-app-deployment.yaml
2. Fix CoreDNS Configuration
Why:
Misconfigured CoreDNS leads to excessive lookups, especially if upstream
/loop
plugins are misused or timeouts are high.
Steps:
- Inspect CoreDNS logs:
kubectl logs -n kube-system -l k8s-app=kube-dns
- Edit CoreDNS ConfigMap:
kubectl edit configmap coredns -n kube-system
Optimizations:
- Set appropriate TTLs:
cache 30
- Minimize
forward
retries:
forward . /etc/resolv.conf {
max_concurrent 1000
}
3. Add Caching for Repeated API Calls
Why:
If microservices make repeated calls to the same APIs (e.g., auth, config, pricing), caching avoids redundant processing and DNS lookups.
Options:
- In-app memory cache (
LRU
,Redis
) - Sidecar caching with tools like Varnish or NGINX
Example using Redis:
# Python Flask example
cache = redis.StrictRedis(host='redis', port=6379, db=0)
@app.route("/get-price")
def get_price():
price = cache.get("product_price")
if price:
return price
price = get_price_from_db()
cache.set("product_price", price, ex=300)
return price
Why Not Add Nodes?
- Slowness here is due to latency, not resource exhaustion.
- Adding nodes increases cost without resolving the actual bottlenecks.
- Smart tuning of networking and caching brings greater results for less overhead.
Why Only These Solutions?
These three changes gave maximum impact with minimal cost:
| Issue | Solution | Reason Chosen |
| ------------------------- | -------------- | --------------------------------------- |
| Excessive pod-to-pod hops | Service Mesh | Centralized control + efficient routing |
| DNS resolution delays | CoreDNS tuning | Reduced lookup overhead |
| Repeated API calls | API Caching | Faster responses + reduced backend load |
Are There Better Alternatives?
Other options like:
- Upgrading to Cilium for eBPF-based networking.
- Using Headless Services to bypass kube-proxy.
- Tuning Kube-proxy, reducing
iptables
hops.
But those are deeper infra-level changes. For most real-world apps, the mesh + DNS fix + caching strategy solves 80% of latency complaints without scaling costs.
Always Measure Before Scaling
Before scaling compute nodes, check usage metrics:
kubectl top pods --all-namespaces
Final :
Before scaling your Kubernetes cluster, optimize what you already have:
- Service mesh for communication efficiency
- CoreDNS tuning to reduce DNS latency
- Caching to eliminate repetitive calls
These are network-aware, cost-effective, and production-ready solutions that bring measurable performance improvements.
Top comments (0)