DEV Community

Harish
Harish

Posted on

Your DNS check is lying to you

Or: how a "this host is dead" verdict from a single net.LookupHost call quietly broke our crawler, and what we did about it.


The setup

We run a crawler that fetches tens of thousands of corporate websites a day from a datacenter. Before we spend any budget on a fetch — the actual HTTP request, the residential proxy hop, the S3 upload — we run a cheap reachability gate. The job of the gate is one thing: answer the question "is it even worth trying to fetch this host from here?"

The first version of that gate was the obvious thing: resolve the host. If DNS returns an IP, the host exists. If it doesn't, mark the URL dead and move on.

That gate was wrong often enough to matter. This is the story of the four ways it was wrong, and the gate we ended up with.


Why "just resolve the host" isn't enough

A naive reachability check has the shape:

Call net.LookupHost. If it returns IPs, the host is reachable. If it errors, it isn't.

Every clause in that sentence is a lie in production. Here are the four leaks we hit, in order of how painful they were.

Leak 1 — CNAME chains the resolver doesn't finish in time

A lot of corporate sites don't resolve directly. They sit behind a CDN, which sits behind a tenant-specific alias, which sits behind a regional load-balancer name. From DNS's point of view, that's a CNAME chain:

ir.bigcorp.com  →  bigcorp.cdnvendor.net  →  edge-eu-west-3.cdnvendor.net  →  A 203.0.113.42
Enter fullscreen mode Exit fullscreen mode

LookupHost is supposed to chase the chain transparently and hand you the final IP. It usually does. But "usually" hides two real failure modes:

  • The resolver chases the chain in series under a single deadline. A slow hop two-thirds of the way down eats the whole budget; the call returns a timeout, not the IP it would have found with another 200ms.
  • An intermediate hop misbehaves — wrong record type, NXDOMAIN at a tier the resolver doesn't expect, a stub that's been decommissioned. The lookup fails even though the host is registered and reachable through other paths.

Both look identical to the caller: LookupHost returned an error. The naive gate calls the host dead. The next day a human checks, the site loads fine in a browser, and we've burned a perfectly good URL.

Leak 2 — datacenter IPs get blocked silently

Plenty of origins explicitly drop connections from cloud IP ranges. From a residential connection they answer instantly; from our crawler's egress IP the TCP handshake just times out. There's no error code that says "I'm filtering you" — it looks exactly like a dead origin.

DNS resolves fine in that case, so the naive gate passes the host through. The fetch then burns its full timeout on a connection that was never going to land. Worse, when we do go through our residential proxy, the host fetches cleanly. The check we wanted to make — "is the origin actually down or is it just down **for us" — wasn't being made anywhere.

Leak 3 — TLS versions that our real client will refuse anyway

Our production HTTP clients pin MinVersion: TLS 1.2. Some long-tail origins still only negotiate TLS 1.0/1.1. DNS passes, the TCP handshake passes, the TLS handshake fails with a protocol-version alert, and we've spent a residential proxy request finding that out.

If we'd noticed at the gate that the server's best offer was below our floor, we could have failed the URL immediately and saved the spend.

Leak 4 — costing the proxy on requests that didn't need it

Residential proxy requests are not free. Routing every uncertain host through the proxy "just to be sure" turns a reachability check into one of the most expensive parts of the pipeline. Whatever we built had to use the proxy as a tiebreaker, not as a first resort.


What we built instead

A three-stage gate, ordered cheapest-and-most-certain first. Each stage can short-circuit the result; the proxy is only touched when the cheap stages genuinely can't tell.

Stage 1 — DNS that walks the CNAME chain by hand

The fast path is still LookupHost. It works for the vast majority of hosts, including most CNAME chains, and it costs nothing extra.

The slow path is what changes. When LookupHost fails, we don't conclude "dead host" — we conclude "the resolver couldn't finish the chain in one shot." So we walk the chain ourselves: LookupCNAME, advance one hop, try LookupHost at that level, repeat. Several things fall out of this:

  • A bounded hop count (we picked 5) protects us from CNAME loops and pathologically deep chains.
  • A visited set catches loops that don't show up as depth — a CNAME that points back to a name we've already seen.
  • A canonical-name dead-end (CNAME points to itself, or to a name with no further records) returns the original error, so genuine NXDOMAINs still surface as NXDOMAINs.
  • An intermediate hop that resolves where the full chain timed out is treated as a pass. The reasoning: the original timeout was almost certainly cumulative, not terminal. If any level in the chain has a working A record, the host is alive.

This single change moved a measurable chunk of URLs out of the "dead" bucket.

Stage 2 — TLS as a three-way verdict, not a boolean

A direct TLS dial from our datacenter is doing two jobs at once. It's checking whether the server speaks a version we accept, and it's checking whether the server answers us at all. Those two outcomes need different handling, so we classify the dial as one of three things:

  • Usable — handshake completed, negotiated version ≥ TLS 1.2. Pass. The fetch will work.
  • Rejected — the server sent a TLS alert, or negotiated below our floor. Fail fast, and importantly: do not waste a proxy probe on this. Our real client would be rejected the same way; the proxy can't fix a TLS-version mismatch.
  • Inconclusive — a bare network error (timeout, connection refused, reset). From a datacenter IP this could mean a dead origin, or an origin that filters cloud ranges. We don't know yet, so we defer to stage 3.

The trick that makes stage 2 work is dialing with a permissive MinVersion (TLS 1.0). We want to see what the server can do, then enforce our floor ourselves — otherwise the handshake fails the version check before we get to see what was actually negotiated, and "version too old" becomes indistinguishable from "didn't answer."

Distinguishing a TLS alert from a network error needs a little care: timeouts are network, tls.AlertError is a server-sent alert, anything else carrying a tls: marker (record-header mismatch, plaintext where TLS was expected, protocol-version errors) is a TLS-layer rejection.

Stage 3 — residential proxy as a tiebreaker

Only the inconclusive case reaches here. The question we're answering is "is the origin actually dead, or is it just dead **for our datacenter IP?" — and the only way to answer it is to ask from a non-datacenter vantage point.

We send a single GET through the residential proxy with a generous budget (~20s — a residential hop is much slower than a local dial, and a tight budget would itself produce false negatives). Reachability semantics, not success semantics: any HTTP response — 200, 301, 403, even 404 — proves the origin is up and answering, so it passes the gate. The gate fails only when the request never reaches a responding origin: a transport error, or a proxy-upstream 5xx (502/503/504, the way our proxy signals it couldn't reach the upstream).

One retry with a small randomized backoff absorbs transient edge failures at the proxy without ballooning cost. If the proxy is unconfigured or its URL is malformed, the gate refuses to declare a host dead on the strength of our own broken config — it returns "cannot confirm" and lets the host through.


The flow

                                ┌─────────────────────────────┐
                                │     ResolveHost(host)       │
                                └──────────────┬──────────────┘
                                               │
                                               ▼
                            ┌──────────────────────────────────┐
                            │   Stage 1: DNS                   │
                            │   LookupHost(host)               │
                            └──────────────┬───────────────────┘
                                           │
                       ┌───────────────────┴────────────────────┐
                       │                                        │
                       ▼ ok                                     ▼ error
              (continue to stage 2)                ┌────────────────────────┐
                       │                           │  Walk CNAME chain      │
                       │                           │  hop-by-hop, max 5     │
                       │                           │  - dedupe via visited  │
                       │                           │  - retry LookupHost    │
                       │                           │    at each hop         │
                       │                           └────────────┬───────────┘
                       │                                        │
                       │              ┌─────────────────────────┴──────────────┐
                       │              │ any hop resolves    chain dead / loop  │
                       │              ▼                     ▼                  │
                       │      (continue to stage 2)   FAIL (real NXDOMAIN /    │
                       │                              chain exhausted)         │
                       │                                                       │
                       ▼                                                       │
            ┌──────────────────────────────────────────┐                       │
            │ Stage 2: TLS probe                       │                       │
            │   tls.Dial host:443                      │                       │
            │   MinVersion = TLS 1.0 (permissive)      │                       │
            │   classify the outcome                   │                       │
            └─────────────┬────────────────────────────┘                       │
                          │                                                    │
       ┌──────────────────┼─────────────────────────┐                          │
       ▼ usable           ▼ rejected                ▼ inconclusive             │
  handshake ok,     TLS alert, or                 bare net error               │
  negotiated ≥ 1.2  negotiated < 1.2              (timeout/refused/reset)      │
       │                  │                            │                       │
       ▼                  ▼                            ▼                       │
     PASS              FAIL (fast,             ┌─────────────────────────┐     │
                       don't touch proxy)      │ Stage 3: proxy probe    │     │
                                               │   GET https://host/     │     │
                                               │   via residential proxy │     │
                                               │   retry once + jitter   │     │
                                               └─────────────┬───────────┘     │
                                                             │                 │
                                              ┌──────────────┴───────────┐     │
                                              ▼ origin answered          ▼     │
                                              (any HTTP status)     transport  │
                                              PASS                  err / 5xx  │
                                                                     FAIL      │
                                                                               │
                                                                               │
                          ◄────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

What we learned along the way

A few generalisable things fell out of building this.

A reachability gate has to model where it's running from. A check that's perfectly accurate from a laptop is wrong half the time from a datacenter. The vantage point is part of the system.

Failure has more states than success does. "Dead" is at least three different things — DNS doesn't resolve, TLS won't handshake, network won't connect — and conflating them means you can't act on them differently. The three-way TLS outcome was the single biggest fix in this whole thing.

Order checks by cost, and let cheap checks short-circuit expensive ones. Stage 1 catches most dead hosts. Stage 2 catches version-incompatible hosts before we burn proxy budget. Stage 3 only runs when the first two genuinely couldn't tell.

Trust your own outputs more than the resolver's outputs. Walking the CNAME chain by hand felt like working around the standard library, but the standard library is doing one thing (give the caller an IP) and we needed another (tell the caller whether the host exists at all). They're not the same question.

Configuration failures must not look like host failures. A missing proxy URL is our problem, not the host's. The gate has to fail open on its own broken config — otherwise a config regression silently nukes thousands of perfectly good URLs.

If you're building anything that touches the long tail of the public web from a datacenter, your "is this host alive" check is probably hiding two or three of these leaks. It's worth a look.

Top comments (0)