Filtering bot and spam traffic out of your analytics

#javascript #webdev #security #analytics

If your analytics counts bots, every number you make decisions on is inflated. Conversion looks worse than it is, traffic looks better than it is. Here is the layered filter we run at ingestion, cheapest checks first.

Same traffic, with bots counted and with them filtered out:

With bots (inflated):

Bots filtered out (honest):

I build this for Zenovay (web analytics). None of these checks is perfect alone, which is why they are layered.

Layer 1: it never ran JavaScript

The single most effective filter. Most crawlers fetch HTML and leave. If your analytics fires from a script, a large class of bots is already excluded because they never execute it. This is free and catches a lot, but not headless browsers, which do run JS.

Layer 2: obvious user agent signatures

const BOT_UA = /(bot|crawl|spider|slurp|headless|phantom|puppeteer|playwright|curl|wget|python-requests|axios)/i;

function looksLikeBotUA(ua = "") {
  if (!ua) return true;             // no UA at all is suspicious
  return BOT_UA.test(ua);
}

A regex on the UA is trivial to spoof, so treat it as a hint, not proof. It mostly clears out honest bots that identify themselves.

Layer 3: datacenter and known ranges

Real users come from residential and mobile networks. A burst from a cloud provider ASN is usually automation. At the edge you often get the ASN for free.

const DATACENTER_ASNS = new Set([
  // a maintained list of cloud/hosting ASNs
  16509, // aws
  15169, // gcp
  8075,  // azure
  14061, // digitalocean
]);

function fromDatacenter(request) {
  const asn = request.cf?.asn;        // cloudflare provides this
  return asn ? DATACENTER_ASNS.has(asn) : false;
}

Careful: some legitimate corporate traffic and VPNs also exit through datacenter ranges, so do not hard drop on this alone. We flag, then combine with behavior.

Layer 4: behavior that is not human

The strongest signal after "did it run JS". Humans are slow and irregular. Bots are fast and uniform.

function behaviorIsBotty(session) {
  // many pageviews in an impossibly short time
  if (session.pageviews >= 10 && session.durationMs < 2000) return true;
  // zero mouse, scroll, or key events across a long visit
  if (session.durationMs > 5000 && session.interactions === 0) return true;
  // perfectly regular timing between events (scripted)
  if (session.events.length > 5 && variance(session.interEventMs) < 1) return true;
  return false;
}

Putting it together

function classify(request, ua, session) {
  let score = 0;
  if (looksLikeBotUA(ua)) score += 2;
  if (fromDatacenter(request)) score += 1;
  if (behaviorIsBotty(session)) score += 3;
  // 0 to 1 = human, 2 = suspicious (flag), 3+ = bot (exclude from metrics)
  return score >= 3 ? "bot" : score >= 2 ? "suspicious" : "human";
}

The decision that matters: drop or flag

Do not delete suspicious traffic. We keep everything but tag it, and exclude bots from the default metrics. That way if our filter is wrong, the data is recoverable, and you can audit how much you are filtering. A silent filter you cannot inspect is its own bug.

Disclosure: I build Zenovay, which does this filtering by default so your numbers are not inflated. The layered approach above is what runs under the hood.

What is your highest signal bot tell? For us it is a long visit with literally zero interaction events.