DEV Community

szp2005
szp2005

Posted on

Reconciling 8 IP-reputation feeds into one verdict: averaging is the wrong default

Wire more than one IP-reputation source into a risk check and sooner or later they disagree. One feed says the IP is a residential ISP address. Another calls it a datacenter VPN. A blocklist says it relayed spam last week. A geolocation provider says it's clean and unremarkable.

The naive move is to normalize everything to 0–100 and average it. I did that first. It produces a number that's wrong in specific, reproducible ways, and on top of that a number nobody can act on. The moment a verdict matters, someone asks "why is this 0.62?" and the average has no answer.

The version I landed on after the averaging one kept embarrassing me reads as a decision log. Every rule below is there because some real IP broke the version before it.

Why averaging fails: three concrete failure modes

1. Low-precision sources dominate the consensus. Some feeds label entire datacenter /16 blocks as "proxy" or "VPN" wholesale. They're cheap and high-recall, so they're noisy. Average them in and a plain Hetzner or Linode box that two of these feeds tagged as "proxy" gets dragged up into mid-risk territory, even when every higher-precision source says it's just hosting. You've shipped a scorer that cries wolf on half of AWS.

2. A single low-confidence report flips a binary feed. Abuse-report databases are community-fed. If your rule is flagged = (totalReports > 0), one retaliatory or mistaken report marks an address as a known abuser. I watched 8.8.8.8, Google Public DNS, come back as "abuser" because somebody somewhere reported it once. Averaging doesn't save you. It buries the bad signal under the good ones for most IPs and then surfaces it on the unlucky ones.

3. Averaging dilutes the one source that matters most. A live spam-relay listing, or membership in a Tor exit-node list, sits close to ground truth. Seven geolocation feeds saying "nothing unusual" should not be allowed to wash that out. Risk signals aren't symmetric, and an average pretends they are.

The model: visible per-source verdicts, asymmetric floors

Two ideas did most of the work.

The first: don't collapse to one opaque number. Keep every source's verdict and show it as its own line item. Which feed, what it claimed, what signal category it falls under (datacenter, residential proxy, Tor exit, active abuser, spam-list hit). Then whoever consumes the score decides whether a given flag matters for their case. A Tor-exit listing is disqualifying for a signup flow and irrelevant for a geo-IP cache.

The second: keep a weighted baseline, but let signal type set a hard floor. The aggregate starts as a precision-weighted average, and then certain confirmed signals impose a minimum the average can't pull below.

Tor exit node (confirmed)      → floor 90
Dedicated proxy/VPN (consensus)→ floor 65
Confirmed abuser               → floor 55
Datacenter / hosting           → floor 35
Enter fullscreen mode Exit fullscreen mode

A floor says: if this signal is present, the score can't drop below X no matter how many geo feeds call the address clean. Swapping type-driven floors in for the pure average is the one change that got the output to line up with what an analyst would actually conclude.

The rules that keep the floors honest

A floor is only as trustworthy as the boolean that trips it. Each of these earned its place by killing a specific false positive.

  • Proxy/VPN needs consensus from dedicated sources. The low-precision general feed never gets to establish a proxy verdict on its own. On datacenter ranges I require ≥2 dedicated (purpose-built proxy/VPN) sources to agree. On residential ranges ≥1 is enough, since a residential proxy is rarer and so means more when a specialized feed flags it. Hetzner and Linode fall back to "hosting 35" instead of a phantom "proxy 65," and a real consumer-ISP proxy still trips.

  • Tighten the noisy binary feed. An abuse listing now requires score ≥ 25 AND reports ≥ 3 (or ≥2 distinct reporters), and the address can't be on the provider's own allowlist. 8.8.8.8 stops being an abuser.

  • Whitelist known infrastructure ASNs. Google, Cloudflare, and the like suppress the abuser and hosting floors. A CDN edge node isn't a threat, and you don't want your scorer picking fights with the backbone of the internet.

  • Treat ASN reputation as standalone evidence. A small set of autonomous systems are VPN/proxy-only businesses: M247, Mullvad, Proton, a handful of others. For these, membership alone settles it, with no cross-source consensus needed, because the network operator's identity is the signal. This recovers the case where one feed alone recognizes a niche VPN that the consensus rule above would otherwise suppress.

  • Add a hard, independent signal: DNSBL over DoH. I query a handful of DNS blocklists, reversing the octets against each zone and going over DNS-over-HTTPS so it runs from an edge runtime. A hit there is close to ground truth and leans on nobody's opaque vendor score.

  • Short-circuit reserved and CGNAT ranges before scoring. CGNAT (100.64.0.0/10), TEST-NET, benchmark, multicast, and the IPv6 equivalents get an explicit "reserved, here's the category" response rather than going through the pipeline to be mislabeled. It also keeps thousands of carrier-NAT users behind one exit from being scored as a shared proxy.

Make the verdict auditable, not just displayable

If I had to press one point on anyone building this, it's this: emit the breakdown as structured data, not just the final number. Every lookup returns each source's contribution, the weighted average before floors, which floors fired and why, and the final value. You get debuggability out of it. When a verdict looks wrong, the breakdown tells you at a glance whether it was a bad weight, a floor that shouldn't have fired, or thin data. You also let the user overrule you: the person reading the score can tell whether it rests on one thin signal or a five-way consensus, and judge for their own case. A black-box number forces all-or-nothing, trust it blind or throw it out.

What I still haven't solved well

A few open problems, since anyone who's done this for real will have opinions.

CGNAT and mobile carriers are the worst of them. Shared-exit NAT and a residential proxy pool throw off the same surface signal: many users, one IP. Short-circuiting the reserved CGNAT block helps, but carriers also use public ranges that look identical to a proxy from the outside. I flag uncertainty rather than guess, and I still don't have a clean discriminator.

Then there's absence of evidence versus evidence of absence. For smaller regional ISPs the databases run thin. "No source flagged it" reads as "clean" when it often just means "nobody has data." Right now I surface coverage, the count of how many sources had any opinion at all, next to the verdict. I'm not convinced that's enough.

Last, the residential-versus-datacenter split. When two classifiers disagree on the same IP I show both labels and leave it unresolved. Whether a confidence-weighted merge beats preserving the raw disagreement, I genuinely don't know.

If you've run reputation scoring at scale, I'd value your take on the /24 neighbor signal (contamination ratio weighted by flag recency?) and on the residential/datacenter conflict above.


The scorer described here runs behind ipok.io, a free, no-login IP reputation checker that shows the per-source breakdown instead of a single number. The CLI is MIT on GitHub. Happy to go deeper on any of the data-source quirks in the comments.

Top comments (0)