Ameer Hamza

Posted on Mar 23

Beyond 'It Works on My Machine': Solving Docker Networking & DNS Bottlenecks

#aws #docker #devops #node

Beyond "It Works on My Machine": Solving Docker Networking & DNS Bottlenecks in Production

You've been there. Your staging environment is green. Your local Docker Compose setup is flawless. But the moment you hit 50% traffic in production, your logs start bleeding EAI_AGAIN and ETIMEDOUT errors.

The culprit? It's rarely your code. It's the silent, often misunderstood layer of Docker Networking and DNS resolution.

In this guide, we're going deep into the production-grade networking issues that plague high-traffic applications. We'll cover why your DNS lookups are failing, how to optimize container-to-container communication, and how to fix the dreaded MTU mismatch that kills packets on AWS.

1. The DNS Resolution Trap: `ndots` and Search Domains

When a container tries to resolve api.internal.service, it doesn't just ask the DNS server once. Because of how Linux handles DNS, it might ask five times.

The Problem: DNS Amplification

By default, Docker (and Kubernetes) sets ndots:5 in /etc/resolv.conf. This means if a hostname has fewer than 5 dots, the resolver will append every search domain in your configuration before trying the absolute name.

If your search domain is my-app.local, a lookup for google.com looks like this:

google.com.my-app.local (NXDOMAIN)
google.com (SUCCESS)

In a microservices architecture with 20 services, this creates a massive, unnecessary load on your internal DNS resolver (CoreDNS or Docker's embedded DNS).

The Fix: Fully Qualified Domain Names (FQDN)

Always append a trailing dot to your internal service calls to bypass the search list.

// ❌ Bad: Triggers search domain lookups
const response = await fetch('http://auth-service/v1/user');

// ✅ Good: Absolute lookup
const response = await fetch('http://auth-service./v1/user');

2. Node.js and the DNS Caching Myth

Did you know that Node.js, by default, does not cache DNS lookups? Every single axios.get() or fetch() call triggers a new DNS request to the OS. Under high load, this can saturate the thread pool and lead to EAI_AGAIN.

Implementation: Implementing a Global Agent

To fix this, you must use a custom http.Agent that implements lookaside caching or use a library like dnscache.

const http = require('http');
const https = require('https');
const dnscache = require('dnscache')({
  "enable": true,
  "ttl": 300,
  "cachesize": 1000
});

// Now all native http/https calls are cached
const agent = new http.Agent({ keepAlive: true });

Production Tip: Keep-Alive is Mandatory

DNS is expensive, but TCP handshakes are worse. Always enable keepAlive: true in your production agents to reuse existing connections.

3. The MTU Mismatch: Why Your Packets are Disappearing

If your app works for small JSON payloads but hangs indefinitely on large file uploads or heavy API responses, you likely have an MTU (Maximum Transmission Unit) mismatch.

The Scenario

Your AWS EC2 instance has an MTU of 9001 (Jumbo Frames).
Your Docker Bridge network defaults to 1500.
Your Overlay network (if using Swarm) adds encapsulation overhead, dropping the effective MTU to 1450.

When a 1500-byte packet hits a 1450-byte tunnel, it gets dropped if the "Don't Fragment" bit is set.

The Fix: Aligning MTU in Docker Compose

You must explicitly set the MTU for your Docker networks to match your infrastructure.

networks:
  app-network:
    driver: bridge
    driver_opts:
      com.docker.network.driver.mtu: "1450"

4. Service Discovery: Internal vs. External DNS

In a production environment (especially on AWS ECS or EKS), you often mix Docker-internal service discovery with external AWS Cloud Map or Route53 Private Zones.

The "Gotcha": Docker's Embedded DNS

Docker's embedded DNS server (at 127.0.0.11) is great for local development, but it has a hardcoded 30-second TTL for external lookups. If your database fails over and Route53 updates the IP, your containers might still be hitting the dead IP for 30 seconds.

Solution: Custom DNS Options

Override the DNS settings in your docker-compose.yml or ECS Task Definition to point directly to your VPC resolver.

services:
  api:
    image: my-node-app
    dns:
      - 10.0.0.2 # AWS VPC Resolver
    dns_opt:
      - timeout:2
      - attempts:3

5. Common Pitfalls in Production

1. Using `localhost` in Containers

localhost inside a container refers to the container itself, not the host machine. Use the service name defined in your Compose file.

2. IPv6 Ghosting

If your host has IPv6 enabled but your Docker network doesn't, some libraries will try to resolve AAAA records first, wait for a timeout, and then fallback to IPv4. This adds ~1-2 seconds of latency to every new connection.
Fix: Disable IPv6 in the container if not needed.

3. Port Exhaustion (Ephemeral Ports)

If you are making thousands of outbound requests from a single container, you might run out of ephemeral ports.
Fix: Increase net.ipv4.ip_local_port_range via sysctls in your Docker config.

services:
  worker:
    image: heavy-requester
    sysctls:
      - net.ipv4.ip_local_port_range=1024 65535

Conclusion & Discussion

Docker networking isn't "magic." It's a collection of iptables rules, namespaces, and virtual interfaces. When you move to production, the default settings that make development easy become the bottlenecks that kill performance.

Key Takeaways:

Use FQDNs (with a trailing dot) to avoid DNS search domain overhead.
Implement DNS caching and TCP Keep-Alive in your application code.
Match your Docker MTU to your cloud provider's network.
Monitor your DNS resolver's latency—it's often the first thing to fail under load.

What's the weirdest networking bug you've encountered in a containerized environment? Let's discuss in the comments below!

About the Author

Ameer Hamza is a Full-Stack Engineer specializing in high-performance architectures using Laravel, Node.js, and AWS. He builds scalable SaaS solutions and writes about bridging the gap between development and production-grade infrastructure.

DEV Community

Beyond 'It Works on My Machine': Solving Docker Networking & DNS Bottlenecks

Beyond "It Works on My Machine": Solving Docker Networking & DNS Bottlenecks in Production

1. The DNS Resolution Trap: `ndots` and Search Domains

The Problem: DNS Amplification

The Fix: Fully Qualified Domain Names (FQDN)

2. Node.js and the DNS Caching Myth

Implementation: Implementing a Global Agent

Production Tip: Keep-Alive is Mandatory

3. The MTU Mismatch: Why Your Packets are Disappearing

The Scenario

The Fix: Aligning MTU in Docker Compose

4. Service Discovery: Internal vs. External DNS

The "Gotcha": Docker's Embedded DNS

Solution: Custom DNS Options

5. Common Pitfalls in Production

1. Using `localhost` in Containers

2. IPv6 Ghosting

3. Port Exhaustion (Ephemeral Ports)

Conclusion & Discussion

About the Author

Top comments (0)

Beyond "It Works on My Machine": Solving Docker Networking & DNS Bottlenecks in Production

1. The DNS Resolution Trap: ndots and Search Domains

The Problem: DNS Amplification

The Fix: Fully Qualified Domain Names (FQDN)

2. Node.js and the DNS Caching Myth

Implementation: Implementing a Global Agent

Production Tip: Keep-Alive is Mandatory

3. The MTU Mismatch: Why Your Packets are Disappearing

The Scenario

The Fix: Aligning MTU in Docker Compose

4. Service Discovery: Internal vs. External DNS

The "Gotcha": Docker's Embedded DNS

Solution: Custom DNS Options

5. Common Pitfalls in Production

1. Using localhost in Containers

2. IPv6 Ghosting

3. Port Exhaustion (Ephemeral Ports)

Conclusion & Discussion

About the Author

1. The DNS Resolution Trap: `ndots` and Search Domains

1. Using `localhost` in Containers