Axel

Posted on Apr 24

TCP worked, UDP silently died: shipping a DNS proxy to fly.io

#networking #rust #security #showdev

I shipped v0.2.0 of dnsink — a Rust DNS proxy with threat-intelligence feeds and DNS tunneling detection — to fly.io this week. TCP worked on the first deploy. UDP silently dropped every reply. This is the debug.

What dnsink does

dnsink is a small Rust daemon that sits between a DNS client and its upstream resolver. It checks every query against live threat-intel feeds (URLhaus, OpenPhish, PhishTank) and blocks matches at the DNS layer, returning NXDOMAIN. Clean queries get forwarded, optionally over DoH. The whole lookup path — bloom filter pre-screen + radix trie confirm — takes ~288 ns on a miss.

For a portfolio deploy I wanted a live public endpoint on fly.io. The image was on ghcr.io, multi-arch, distroless-nonroot. fly.toml was straightforward — UDP + TCP DNS on port 53, Prometheus metrics on 9090. flyctl deploy ran clean. Two machines came up, health checks passed.

The symptom

$ dig @dnsink.fly.dev example.com +tcp
; <<>> DiG 9.18.39 <<>> @dnsink.fly.dev example.com +tcp
;; Got answer:
;; ANSWER SECTION:
example.com. 292 IN A 104.20.23.154

$ dig @dnsink.fly.dev example.com
;; communications error: timed out
;; no servers could be reached

TCP: clean resolution. UDP: timeout.

Same port, same server, same protocol family at the transport layer. The only difference was the third argument on the socket call.

Diagnosis

First instinct: the container isn't listening on UDP. The logs said otherwise:

INFO dnsink::proxy: listening on 0.0.0.0:5353 (UDP + TCP)

Next: maybe fly isn't routing UDP to the machine. If UDP packets don't arrive, the dnsink_queries_total counter stays flat. After several dig attempts, the counter... stayed at 0. But that told me two things at once — either packets weren't arriving, OR they were arriving but the response path was broken. Counter would increment on RECEIVE, not on successful reply.

I re-ran after adding a log line on UDP receive. Packets WERE arriving. The machine saw the queries. The replies just never made it back to me.

The actual bug

fly.io's UDP docs have one critical line that didn't match the symptom I was searching for:

To receive UDP packets, your app needs to bind to the special fly-global-services address. Standard addresses like 0.0.0.0, *, or INADDR_ANY won't work properly because Linux will use the wrong source address in replies if you use those.

The word "receive" in the first sentence is misleading. Packets DO receive on 0.0.0.0. The problem is the REPLY — when Linux responds to a UDP datagram, it picks a source IP based on the local interface the packet goes out on. With a wildcard-bound socket on fly, the kernel picks the machine's private fly-internal IP as the source. fly-proxy accepts the reply, looks at the source IP, doesn't recognize it as belonging to the publicly-routed IPv4, and drops it.

Binding to fly-global-services tells the kernel: "use this specific address as the source on replies." fly's infra then sees the right source IP and forwards correctly.

The reason this doesn't bite TCP is because TCP is connection-oriented and fly-proxy maintains the connection state at the edge — it knows where to route the response regardless of source-IP inconsistencies. UDP has no such state. Each reply packet has to route itself based on its own header.

The fix (first attempt, broken)

I changed the bind address in my baked config:

[listen]
address = "fly-global-services"
port = 5353

Redeployed. dig @dnsink.fly.dev example.com now worked over UDP.

And broke TCP.

$ dig @dnsink.fly.dev example.com +tcp
;; communications error: end of file

TCP on fly goes through fly-proxy, which connects to the container via the container's published ports. If the listener binds to a specific local IP (fly-global-services resolves to the fly-internal IPv4), fly-proxy's connection from the public ingress interface doesn't land on that listener — it's bound to a different interface than the one fly-proxy is dialing in on.

UDP needs fly-global-services. TCP needs a wildcard bind. They conflict.

The fix (real)

The clean solution is asymmetric binds — UDP to fly-global-services, TCP to wildcard. I added an optional tcp_address override to dnsink's listen config:

pub struct ListenConfig {
    pub address: String,
    pub port: u16,
    // Optional TCP-specific bind. When None, TCP uses `address`.
    #[serde(default)]
    pub tcp_address: Option<String>,
}

And in the proxy:

let port = self.config.listen.port;
let udp_addr = format!("{}:{}", self.config.listen.address, port);
let tcp_addr = match &self.config.listen.tcp_address {
    Some(a) => format!("{a}:{port}"),
    None => udp_addr.clone(),
};

let udp_socket = UdpSocket::bind(&udp_addr).await?;
let tcp_listener = TcpListener::bind(&tcp_addr).await?;

The config.docker.toml shipped with the image now specifies:

[listen]
address = "fly-global-services"
tcp_address = "[::]"
port = 5353

Local Docker runs (without fly-global-services resolvable) override the config via bind-mount to use 0.0.0.0 or [::] for both. The default for direct cargo run stays at 127.0.0.1, unaware of any of this.

Deployed. TCP works. UDP works. Metrics work.

$ dig @dnsink.fly.dev example.com +tcp
;; ANSWER SECTION:
example.com. 300 IN A 104.20.23.154

$ dig @dnsink.fly.dev -p 5353 example.com
;; ANSWER SECTION:
example.com. 300 IN A 104.20.23.154

$ curl https://dnsink.fly.dev/metrics
dnsink_queries_total 42
dnsink_queries_allowed_total 42

Takeaways

Platform quirks are honest. The listen.tcp_address override isn't elegant — a clean config shouldn't expose host-platform specifics. But the alternative is hard-coding fly.io detection into dnsink, which is worse. An optional config field that users only set when deploying to fly is a minimal tax.

Docs are often correct but don't match your symptom. fly's docs say what you need to do — they don't describe what FAILURE looks like when you don't. "Linux uses the wrong source address" is easy to read past when you're debugging a timeout. Finding this required building a mental model of how UDP replies route, not just reading the bind guidance.

Reply-path bugs look like receive-path bugs. I spent time validating my UDP listener was actually listening before realizing the issue was outbound, not inbound. Adding a log line on packet receive would have shortened the debug by 20 minutes.

TCP and UDP aren't interchangeable on serverless platforms. Same port, same listener, different routing path. fly's proxy handles TCP differently from UDP. AWS ELB, GCP Load Balancer, Kubernetes kube-proxy — all have similar asymmetries. If you ship something UDP-heavy, deploy early, break things, and factor the quirks into your config surface.

v0.2.0 is live at github.com/kakarot-dev/dnsink. Docker image at ghcr.io/kakarot-dev/dnsink:v0.2.0 (multi-arch, amd64 + arm64, distroless/cc-debian12:nonroot). The fly.toml and config.docker.toml in the repo are the reference for deploying yourself.

DEV Community