Sankalp Gilda

Posted on May 4

A superscript-1 walks past every Go SSRF guard

#security #go #codeql #semgrep

TL;DR. golang.org/x/net/idna.Lookup.ToASCII runs UTS-46 NFKC mapping
on hostnames, which folds 100 non-ASCII Unicode digit codepoints (math
superscripts, circled digits, fullwidth digits, math-styled digits, and
others) to ASCII 0-9. A pre-IDNA net.ParseIP check rejects the
non-ASCII input as not-an-IP, hands it to the library, and gets back a
real IPv4 literal. That literal then walks past SSRF allowlists,
NO_PROXY lists, TLS-SNI routers, and cookie-domain validators that
only checked the pre-IDNA value. The fix is a post-IDNA TrimRight + ParseAddr
recheck. The blog has the bug, a runnable proof of concept against
golang.org/x/net/http/httpproxy, the canonical safe pattern, and two
just-shipped registry rules (CodeQL + Semgrep) that catch it in CI.

I ran into this one while writing a Go HTTP client for a private project. I
had a host allowlist, I had idna.Lookup.ToASCII canonicalising the host
before dial, and I still could not convince myself the allowlist held. It
did not. A single mathematical-superscript "1" in the host walked straight
through.

The shape is general. Any Go program that calls golang.org/x/net/idna.Lookup.ToASCII
(or the MapForLookup profile, or any custom profile built on
idna.New(idna.MapForLookup(), ...)) on attacker-controlled hostnames is a
candidate. The library does what its specification says it does. The caller
does what its tutorial says it does. Between them, a smuggled IPv4 literal
slips past every SSRF allowlist, every NoProxy rule, and every TLS-SNI
router, and reaches a network sink as if it were a regular hostname.

I reported it privately. The Go security team declined to treat it as a
library bug, on the grounds that the post-IDNA IP-literal check is the
caller's responsibility. The bug class is real either way, and "technically
the caller's fault" is cold comfort for a working engineer staring at an
SSRF in production. So I went looking for what caller-side tooling actually
helps. This post is the writeup: the mechanism, a concrete proof of concept,
the defensive pattern, and the detection rules I shipped at v0.1.1 along
with the empirical work that drove the design.

The mechanism

IDNA stands for Internationalized Domain Names in Applications, and the
"UTS-46" profile of IDNA defines a normalization step that maps Unicode
hostnames to ASCII before they go on the wire. The mapping uses NFKC
compatibility decomposition. If you have not stared at the Unicode tables
recently, NFKC has a property that matters here: it folds compatibility
digit codepoints to their ASCII counterparts. A circled ① becomes 1.
A fullwidth ０ becomes 0. A mathematical superscript ¹ becomes 1.
A double-struck 𝟙 becomes 1. And so on, across a hundred different
codepoints in eight families: Latin-1
superscripts, mathematical superscripts, mathematical subscripts, circled
digits, fullwidth digits, mathematical bold and sans-serif and double-struck
and monospace digits, and the segmented digits in the Symbols for Legacy
Computing block.

Now consider a host string like 0.¹.0.0. The 0 and . are ASCII; the
¹ is U+00B9, mathematical superscript one. The bug class is the
absence of a post-IDNA IP-literal recheck. The shape that matters is not
"caller did a pre-IDNA net.ParseIP then trusted the post-IDNA result";
the shape that matters is "the post-IDNA value flowed to a network sink
without any IP-literal check at the post-mapping point." A pre-IDNA check,
when present, makes the bug worse by giving reviewers a false sense of
input validation, but its absence is not what creates the smuggle. The
absence of a post-IDNA check is.

The most common real-world version of this shape lives in
golang.org/x/net/http/httpproxy.canonicalAddr. There is no pre-IDNA
guard at all. The function takes a *url.URL, calls a small wrapper that
runs idna.Lookup.ToASCII on the host, and feeds the result straight to
net.JoinHostPort. The post-IDNA value is the host that decides whether
the request goes to the configured proxy or to the origin directly.
Anything that NFKC-folds to a numeric literal walks past NoProxy and
out to wherever the smuggled literal points.

The Go module documentation does not mention this. The (*Profile).ToASCII
godoc says it converts the input to ASCII according to the rules of UTS-46.
It does. There is no IP-literal detection in the idna package because
detecting IP literals is not what UTS-46 specifies. Any caller that needs
to reject post-mapping IP literals has to do that work themselves.

A two-line proof of concept

Here is a self-contained Go program that exhibits the bug against
golang.org/x/net/http/httpproxy. The httpproxy package canonicalises the
request URL host before consulting the operator's NO_PROXY list. It does
that canonicalisation through idna.Lookup.ToASCII, and it does no
post-mapping IP-literal recheck.

package main

import (
    "fmt"
    "net/url"

    "golang.org/x/net/http/httpproxy"
)

func main() {
    cfg := &httpproxy.Config{
        HTTPSProxy: "https://corporate-mitm-proxy.internal:8080",
        HTTPProxy:  "http://corporate-mitm-proxy.internal:8080",
        NoProxy:    "0.1.0.0",
    }
    proxyFunc := cfg.ProxyFunc()

    cases := []string{
        "https://example.com/api",
        "https://0.1.0.0/api",
        "https://0.¹.0.0/api",
        "https://１９２．１６８．１．１/api",
    }
    for _, raw := range cases {
        u, _ := url.Parse(raw)
        proxy, _ := proxyFunc(u)
        fmt.Printf("[%s] proxy=%v\n", raw, proxy)
    }
}

Run that against golang.org/x/net@v0.53.0. The first URL routes through
the configured corporate proxy. The second URL matches NO_PROXY=0.1.0.0
and bypasses the proxy directly to the literal 0.1.0.0. The third URL,
the smuggled one, also bypasses the proxy: 0.¹.0.0 canonicalises to
0.1.0.0, which matches the NO_PROXY entry. The fourth URL, fullwidth
192.168.1.1, behaves the same way against any allowlist that lists
192.168.1.1 or any RFC 1918 range.

The exploit shape varies by caller. For httpproxy, it is an
egress-monitoring or DLP bypass: the attacker takes the unmonitored path.
For an HTTP client that uses idna.ToASCII then dials, it is a classic
SSRF: the attacker reaches loopback, RFC 1918, link-local, or cloud
metadata endpoints by smuggling those literals through a guard that only
checks ASCII IPs. The classic worked example is the AWS IMDS endpoint at
169.254.169.254: an attacker who can route a fullwidth １６９.２５４.１６９.２５４
or a math-styled 𝟣𝟨𝟫.𝟤𝟧𝟦.𝟣𝟨𝟫.𝟤𝟧𝟦 past your guards reaches the
instance-credential surface from your own service. For a TLS-SNI-based
router, it is a routing-table bypass.
For a cookie-domain validator, it is a cookie-scope confusion that hands
the attacker cookies issued for a numeric host.

Where the spec ends and the caller begins

UTS-46 is an Internet standard governing hostname canonicalisation. It is
not an SSRF defence library. The golang.org/x/net/idna package
implements that spec, and the spec does not mandate IP-literal rejection.
Read on those terms, the post-IDNA recheck belongs to the caller. That
is the position the golang.org/x/net/idna maintainer took in response
to the private report, and it is internally consistent: a library that
quietly added an IP-literal rejection step would be deviating from the
specification it claims to implement, and would surprise other
specification-conformant callers downstream.

So the question is not whether the spec is correct. It is what
caller-side guardrail looks like when the spec leaves the recheck to the
caller. The anti-pattern is widespread for the same reason most security
bugs are widespread: the shape "canonicalise, then use" reads as
obviously correct to anyone who has not already been bitten by it.
A reviewer scanning a hundred lines of HTTP plumbing has no reason to
flag an idna.Lookup.ToASCII call followed by a JoinHostPort. The fix
has to live somewhere the reviewer's eyes do not have to be: in static
analysis, in CI gates, in codemods that rewrite the call site so the
guard is impossible to forget.

The fix

Trim trailing dots and re-check with net.ParseIP or
netip.ParseAddr after the IDNA call. Reject if the result parses as an
IP literal:

var errIDNAIPLiteralSmuggle = errors.New("idna: post-mapping IP literal smuggle")

func canonicaliseHost(host string) (string, error) {
    ace, err := idna.Lookup.ToASCII(host)
    if err != nil {
        return "", err
    }
    if _, ipErr := netip.ParseAddr(strings.TrimRight(ace, ".")); ipErr == nil {
        return "", errIDNAIPLiteralSmuggle
    }
    return ace, nil
}

Two properties of that guard are load-bearing.

First, strings.TrimRight(ace, ".") and not strings.TrimSuffix(ace, ".").
UTS-46 maps fullwidth dot U+FF0E and ideographic dot U+3002 to ASCII
dot. An input like 0.¹.0.0．． (two trailing fullwidth dots after the
last numeric label) maps to 0.1.0.0.. post-IDNA. TrimSuffix(_, ".")
strips one dot and leaves 0.1.0.0., which netip.ParseAddr rejects as
non-IP, silently passing the smuggle through. TrimRight removes any number
of trailing dots and closes the variant.

Second, the recheck has to happen after the IDNA call. A pre-IDNA
net.ParseIP is worse than no guard at all: it gives the reviewer the
false impression that the input shape has been validated, which is exactly
why the pre-IDNA-only check is the most common form of the anti-pattern in
the wild. The smuggled literal is, by construction, not an IP before
mapping. The check has to be post-mapping, or it does not catch the bug.

Here is a thing worth pausing on. I went looking for production callers
already doing this canonical guard. Across 19 OSS Go repositories and 31
production callsites of idna.*ToASCII (a sweep I will get to in a
moment), zero used the TrimRight + netip.ParseAddr shape. One came
close: google/safebrowsing does a both-sided strings.Trim plus an
in-house parseIPAddress. Everyone else just returned the post-IDNA
string and trusted downstream.

That is the install base. That is what detection has to handle.

What I shipped, and why this shape

The repo is at
https://github.com/astrogilda/idna-ip-literal-smuggle-rules. The latest
verified release is v0.1.1. The earlier v0.1.0 mark predates the CodeQL
DB-backed verification I will describe below; treat v0.1.1 as the first
release where I am confident the CodeQL query actually fires on the
canonical wrapper shape in a real DB extraction, not just on synthetic
fixtures.

The full strategy synthesis lives in the repo at
https://github.com/astrogilda/idna-ip-literal-smuggle-rules/blob/main/docs/research/v0.1-detection-strategy.md.
The short version is two layers of detection plus one deliberate omission.

Layer 1: CodeQL is the primary recall vehicle

IdnaIpLiteralSmuggle.ql uses TaintTracking::GlobalWithState with two
flow states (TPreIdna, TPostIdna). A single barrier predicate,
safePostIdnaRecheck(postIdnaSource, node), ties the trim source to the
post-IDNA tainted predecessor, so a recheck on an unrelated value cannot
silence the alert. The state-transition step covers
(*idna.Profile).ToASCII and (*idna.Profile).ToUnicode; the
package-level idna.ToASCII is excluded as a Punycode wrapper, no NFKC
mapping, no smuggle surface. Sinks span 11 families: JoinHostPort, the
Dial family, (*url.URL).Host, (*tls.Config).ServerName,
(*http.Cookie).Domain, HTTP client request URLs, and the package-level
and (*net.Resolver) DNS primitives.

CodeQL's inter-procedural taint walks through a one-deep wrapper like
idnaASCII for free, no isAdditionalFlowStep modelling required. The
URL.Hostname taint enters the wrapper, propagates through its body, exits
as the return value, and flows into the caller's net.JoinHostPort sink.
This is how the bug actually appears in production code, so this is what
detection has to model.

I extracted a CodeQL DB for golang.org/x/net/http/httpproxy and ran the
query. At v0.1.1 it fires twice on the canonical canonicalAddr shape
reproducer, registers 23 unique sink alerts on the positive-fixture suite,
and emits zero genuine false positives on the negative fixtures. That is
the first version where I trust the rule.

Layer 2: Semgrep OSS as the direct-call precision sweep

idna-ip-literal-smuggle.yaml is a mode: taint rule, intra-procedural,
runs against community Semgrep with no Pro features required. There is a
sibling -pro.yaml that adds interfile: true for operators with the
Pro Engine, and an opt-in -experimental.yaml that widens the source set
to hostname-typed field reads. Default OSS first; Pro and experimental
are opt-in.

Now the part that surprised me. I ran the OSS, Pro, and experimental
yamls against three corpora: golang/go, kubernetes/kubernetes, and
prometheus/prometheus, totalling 660 MB of Go. The result table is
short:

Rule	golang/go	kubernetes	prometheus
`idna-ip-literal-smuggle` (OSS)	0	0	0
`idna-ip-literal-smuggle-pro`	0	0	0
`idna-ip-literal-smuggle-experimental`	0	0	0

Zero. Across all three rules and all three corpora.

My first instinct was the same as yours probably is now: the rule
under-fires. But there are exactly two production callsites of any
UTS-46-mapping idna.*ToASCII profile in those 660 MB, and both are
wrapped:

golang/go: src/net/http/request.go:799
golang/go: src/vendor/golang.org/x/net/http/httpproxy/proxy.go:312

Other idna usages in the corpus call package-level idna.ToASCII,
which dispatches to the Punycode profile, which is correctly out of
scope. The OSS rule cannot step through idnaASCII because OSS Semgrep
is intra-procedural. The Pro yaml's interfile: true is a no-op without
the Pro Engine binary, so against community Semgrep it behaves the same
as OSS. The experimental yaml's relaxed source set still requires the
matching field on a struct passed directly to idna.*ToASCII in the
same function, which does not occur in these codebases.

Zero is the honest answer at the OSS tier for a corpus where the only
in-scope callsites are wrapped. CodeQL with TaintTracking::GlobalWithState
catches them, no Pro licence required. Semgrep OSS catches the direct-call
shape, and the direct-call shape is what fits in an intra-procedural
analyzer's mouth. The right move is not to make Semgrep OSS fire louder
by relaxing the source set; the right move is to lead the registry
submission with CodeQL and frame Semgrep OSS as the precision sweep, not
the recall workhorse.

Layer 3: I deliberately did not ship a blanket structural rule

This is the call I want to defend explicitly because the obvious thing
to ship is exactly the wrong thing.

I catalogued 31 production callsites of idna.*ToASCII across 19
distinct repos: caddy, vault, certmagic, ooni/probe-cli, smallstep,
sing-box, mattermost, hostmatcher, lorawan-stack, tlsproxy, datadog-agent,
cloudflared, mosn, safebrowsing, sniproxy, q, whatwg-url, plus the Go
stdlib and the x/net mirrors. The classification:

Class	Count	Description
(a) direct call	18	`idna.Lookup.ToASCII(input)` directly in a function whose source is identifiable
(b) one-deep wrapper	6	small helper named `idnaASCII` / `toASCII` returning the call result raw
(c) multi-deep / conditional wrapper	6	helper that branches on `isASCII` / `len` and only calls ToASCII on a sub-path
(d) post-call IP recheck present	1	`google/safebrowsing/urls.go:260`, both-sided `strings.Trim` plus in-house parseIPAddress

A blanket structural rule along the lines of "any ToASCII without
TrimRight + netip.ParseAddr in the same block" would fire on 30 of 31
callsites. The 30 are not all bugs. Many are PSL walkers, registrar
pipelines, and TLS-cert-manager code where the ToASCII result never
network-routes on attacker input, so the missing recheck is not a smuggle
vector. weppos/publicsuffix-go, x/net/publicsuffix, cloudflare-go issue

688, and every PSL-driven cookiejar codepath are documented

non-network-routing IDNA users. A rule that flags them as smuggle bugs
pathologises the documented library contract. It would be roughly 95%
false-positive in the strict "vulnerable to UTS-46 smuggling" sense, and
operators would lose trust within one CI cycle.

This is also, I suspect, why no prior IDNA-class CVE (CVE-2021-29923,
CVE-2024-12224, CVE-2024-3651, CVE-2026-39821) shipped with a Semgrep or
CodeQL rule attached. The detection space is too noisy without
inter-procedural taint scoping. Taint scoping is the work, not an
implementation detail of it.

The two-tool stratification (CodeQL inter-procedural recall plus Semgrep
OSS direct-call precision) is the same shape BadRedirectCheck.ql and
IncompleteUrlSchemeCheck.ql use in the CodeQL community pack today. It
is a defensible registry-submission narrative because it is a documented
prior pattern, not an invention.

Pick one, or pick two

If you have CodeQL in CI already, run the query in nightly MRVA sweeps.
It will catch wrapped callsites in addition to direct ones. If you have
Semgrep in CI already, run the OSS rule on every PR. It is fast, it has
no measured FPs on the three corpora I tested, and it will catch the
direct-call shapes that survive into your code. If you have both, run
both. They answer slightly different questions and the failure modes do
not overlap.

If you maintain Go code that calls idna.Lookup.ToASCII,
idna.Display.ToASCII, or any custom profile constructed via
idna.New(idna.MapForLookup(), ...), add the post-IDNA IP-literal
recheck. If you have downstream allowlists, SSRF guards, NO_PROXY
lists, TLS-SNI routing, or cookie-domain validation that depends on
hostname canonicalisation, you have the bug class somewhere in your
dependency graph. The rules at
https://github.com/astrogilda/idna-ip-literal-smuggle-rules will tell
you where.

Limitations

IPv4 only. IPv6 colons are rejected by IDNA rune-validation before
NFKC runs, so there is no IPv6 path through this mechanism (the
IPv4-mapped-IPv6 macro-encoding class is a separate bug, separate
sanitizer, separate post). Go-specific tooling; the same anti-pattern
exists in Python's kjd/idna, Node's url.domainToASCII, and ICU's
uidna_*, but each ecosystem needs its own rule. WHATWG-integrated
URL parsers (callers that use url.Parse and never touch
idna.*.ToASCII directly) are out of scope: the parser already runs
the ends_in_a_number host-shape check post-decode.

In flight: registry submissions

CodeQL community-pack PR: https://github.com/github/codeql/pull/21784
Semgrep registry PR: https://github.com/semgrep/semgrep-rules/pull/3841

Both PRs reference the upstream strategy doc so reviewers see the
design rationale before asking. If you want to follow along, those
are the threads to watch.

Corrections and additional fold-class fixtures are welcome on the
repository.

DEV Community