Async DNS Resolution with aiodns for High-Throughput Video Crawlers

#python #asyncio #performance #webdev

When you're crawling tens of thousands of video pages per hour to keep a discovery index fresh, every millisecond of latency compounds. At DailyWatch we run a fleet of asynchronous Python crawlers that fan out across regional CDNs, embed providers, and source channels. For a long time we assumed our bottleneck was bandwidth or HTTP keep-alive. It wasn't. It was DNS.

This post walks through how we replaced Python's default name resolution with aiodns and cut p99 fetch latency by roughly 40% on cold cache requests.

The hidden cost of getaddrinfo

asyncio is wonderful right up until it isn't. The default event loop resolves hostnames via loop.getaddrinfo(), which under the hood dispatches socket.getaddrinfo to a thread pool executor. That has three problems for a high-fan-out crawler:

Thread pool starvation. The default executor has min(32, os.cpu_count() + 4) workers. Burst 500 DNS lookups in parallel and most coroutines block waiting for a thread.
GIL pressure. Even though getaddrinfo releases the GIL, the surrounding Python plumbing does not. Resolver-heavy workloads serialize hard.
No control over caching or retries. glibc's NSS layer caches almost nothing in-process, and nscd is rarely deployed on container hosts.

We noticed this when our crawler's CPU sat at 12% while hundreds of tasks pended on what looked like network I/O. py-spy told the real story: the top frame was _socket.getaddrinfo.

Enter aiodns

aiodns is a thin async wrapper around c-ares, a battle-tested C library used by curl, Node.js, and Wireshark. It speaks DNS directly via UDP/TCP without going through the libc resolver, which means:

True non-blocking resolution integrated with the event loop
Configurable nameservers, timeouts, and retries
No thread pool fanout for resolution

Install it with pip install aiodns. On Linux make sure an up-to-date c-ares system package is present.

import asyncio
import aiodns

async def resolve(host: str) -> list[str]:
    resolver = aiodns.DNSResolver(
        nameservers=["1.1.1.1", "8.8.8.8"],
        timeout=2.0,
        tries=2,
    )
    result = await resolver.query(host, "A")
    return [r.host for r in result]

asyncio.run(resolve("example.com"))

Building a resolver pool with in-process caching

For a crawler that hits the same CDN hostnames thousands of times per minute, we cache resolution results with TTL respect. c-ares already enforces TTLs at the protocol level, but caching in Python avoids the round trip entirely on hot paths.

import time
import asyncio
import aiodns

class CachedResolver:
    def __init__(self, ttl_floor: int = 30):
        self._resolver = aiodns.DNSResolver(
            nameservers=["1.1.1.1", "8.8.8.8"],
            timeout=2.0,
            tries=2,
        )
        self._cache: dict[str, tuple[float, list[str]]] = {}
        self._ttl_floor = ttl_floor
        self._locks: dict[str, asyncio.Lock] = {}

    async def resolve(self, host: str) -> list[str]:
        now = time.monotonic()
        cached = self._cache.get(host)
        if cached and cached[0] > now:
            return cached[1]

        lock = self._locks.setdefault(host, asyncio.Lock())
        async with lock:
            cached = self._cache.get(host)
            if cached and cached[0] > now:
                return cached[1]

            records = await self._resolver.query(host, "A")
            ips = [r.host for r in records]
            ttl = max((r.ttl for r in records), default=self._ttl_floor)
            self._cache[host] = (now + max(ttl, self._ttl_floor), ips)
            return ips

The _locks dictionary is critical: without it, a thundering herd of coroutines triggers redundant lookups for the same host on cold misses. The per-host lock guarantees one in-flight query per name.

Wiring it into aiohttp

aiohttp accepts a custom resolver through its TCPConnector. The AsyncResolver shipped in aiohttp already uses aiodns under the hood, but only if you ask for it explicitly.

import aiohttp
from aiohttp.resolver import AsyncResolver

resolver = AsyncResolver(nameservers=["1.1.1.1", "8.8.8.8"])
connector = aiohttp.TCPConnector(
    resolver=resolver,
    limit=200,
    limit_per_host=8,
    ttl_dns_cache=300,
    use_dns_cache=True,
)

async with aiohttp.ClientSession(connector=connector) as session:
    async with session.get("https://example.com/api/feed") as r:
        data = await r.json()

Two flags matter here:

use_dns_cache=True enables aiohttp's own DNS LRU.
ttl_dns_cache=300 overrides aiohttp's default 10-second TTL with something more reasonable for stable CDNs.

If you need TTL-aware caching that respects what the authoritative server actually returns, drop in the CachedResolver above and assign it to TCPConnector.resolver.

Lessons we learned the hard way

Pin your nameservers. Letting containers inherit /etc/resolv.conf from the host means inconsistent behavior across regions. Hard-code 1.1.1.1 and 8.8.8.8, or point at your own recursive resolver.
Set conservative timeouts. c-ares defaults to 5 seconds. For crawlers, 2 seconds with 2 retries fails fast and lets you mark the host dead quickly.
Watch for IPv6 surprises. Some CDNs return AAAA records that point to addresses your egress can't actually reach. Query A and AAAA in parallel and prefer whichever responds first, or disable AAAA in your connector.
Pre-warm before bursts. If you know which 200 hostnames a crawl batch will hit, fire off a gather() of resolutions before launching the actual fetchers.
Log resolver errors separately. A DNSError is not a ClientConnectorError. Conflating them hides nameserver problems behind generic fetch-failed metrics.

What it bought us

After deploying this, our crawler's median time-to-first-byte dropped from 180ms to 90ms. The p99 case, where we were previously waiting on a saturated thread pool, went from 3.2s to 1.1s. CPU utilization went up, which is exactly what you want when network I/O stops being the limiting factor.

DNS is one of those layers nobody thinks about until it bites you. If your async crawler is mysteriously slow at scale, point py-spy at it for thirty seconds. If you see getaddrinfo in the top frames, aiodns is half a day of work and a permanent win.