Ismail Kovvuru

Posted on Jul 30

Kubernetes App Slow? Fix DNS, Mesh & Caching.Not Node Scaling

#kubernetes #networking #cloud #aws

A production Kubernetes application started showing latency issues during peak hours. User reports flagged slow page loads and inconsistent response times.

The initial reaction from the infrastructure team was to add more nodes to the cluster. However, before provisioning additional compute resources, a deeper inspection was performed.
But throwing compute at a latency issue is inefficient and costly.

Root Causes Identified:

Too many service hops
CoreDNS misconfigurations
No caching for repeated API calls

Real Solutions (Not More Nodes)

1. Use a Service Mesh

Why:
Service meshes like Istio or Linkerd reduce latency by enabling intelligent routing, retries, timeouts, and circuit breaking — optimizing pod-to-pod communication.

Commands (Istio example):

# Install Istio
istioctl install --set profile=demo -y

# Enable automatic sidecar injection
kubectl label namespace default istio-injection=enabled

# Deploy your app with mesh support
kubectl apply -f your-app-deployment.yaml

2. Fix CoreDNS Configuration

Why:
Misconfigured CoreDNS leads to excessive lookups, especially if upstream/loop plugins are misused or timeouts are high.

Steps:

Inspect CoreDNS logs:

kubectl logs -n kube-system -l k8s-app=kube-dns

Edit CoreDNS ConfigMap:

kubectl edit configmap coredns -n kube-system

Optimizations:

Set appropriate TTLs:

  cache 30

Minimize forward retries:

  forward . /etc/resolv.conf {
    max_concurrent 1000
  }

3. Add Caching for Repeated API Calls

Why:
If microservices make repeated calls to the same APIs (e.g., auth, config, pricing), caching avoids redundant processing and DNS lookups.

Options:

In-app memory cache (LRU, Redis)
Sidecar caching with tools like Varnish or NGINX

Example using Redis:

# Python Flask example
cache = redis.StrictRedis(host='redis', port=6379, db=0)

@app.route("/get-price")
def get_price():
    price = cache.get("product_price")
    if price:
        return price
    price = get_price_from_db()
    cache.set("product_price", price, ex=300)
    return price

Why Not Add Nodes?

Slowness here is due to latency, not resource exhaustion.
Adding nodes increases cost without resolving the actual bottlenecks.
Smart tuning of networking and caching brings greater results for less overhead.

Why Only These Solutions?

These three changes gave maximum impact with minimal cost:

| Issue                     | Solution       | Reason Chosen                           |
| ------------------------- | -------------- | --------------------------------------- |
| Excessive pod-to-pod hops | Service Mesh   | Centralized control + efficient routing |
| DNS resolution delays     | CoreDNS tuning | Reduced lookup overhead                 |
| Repeated API calls        | API Caching    | Faster responses + reduced backend load |

Are There Better Alternatives?

Other options like:

Upgrading to Cilium for eBPF-based networking.
Using Headless Services to bypass kube-proxy.
Tuning Kube-proxy, reducing iptables hops.

But those are deeper infra-level changes. For most real-world apps, the mesh + DNS fix + caching strategy solves 80% of latency complaints without scaling costs.

Always Measure Before Scaling

Before scaling compute nodes, check usage metrics:

kubectl top pods --all-namespaces

Final :

Before scaling your Kubernetes cluster, optimize what you already have:

Service mesh for communication efficiency
CoreDNS tuning to reduce DNS latency
Caching to eliminate repetitive calls

These are network-aware, cost-effective, and production-ready solutions that bring measurable performance improvements.

DEV Community