DEV Community

Cover image for How I Built a Phishing Domain Detector in Python (Zero API Calls After Download)
Sameer Sheikh for WhoisFreaks

Posted on • Edited on

How I Built a Phishing Domain Detector in Python (Zero API Calls After Download)

Most phishing detection tools are reactive. They wait for a domain to be reported, then add it to a blocklist. By that point, the campaign has already been running for hours or days.

I wanted to flip that. Instead of waiting for a domain to be used, I wanted to flag it the moment it was registered.

Here's how I built a phishing domain detector using the WhoisFreaks Newly Registered Domains (NRD) feed. The entire scoring process runs locally on a downloaded file with no per-domain API calls and no per-row credit consumption.


The Key Insight: The Data Is Already There

Most WHOIS-based tools work like this:

Download list of domains -> loop through -> call WHOIS API for each one
Enter fullscreen mode Exit fullscreen mode

That's slow and expensive. WhoisFreaks' NRD feed works differently. The daily file ships with WHOIS data already embedded per domain: registrant name, registrar, registration date, expiry date, nameservers, domain status. All 65 columns, one row per domain, one file download.

So the workflow becomes:

Download gzipped CSV -> parse locally -> score everything -> done
Enter fullscreen mode Exit fullscreen mode

No loop of API calls. No per-domain credit burn. You get the full picture in a single download.


What the NRD File Looks Like

The WhoisFreaks NRD "With WHOIS" endpoint returns a gzipped CSV. Each row covers one newly registered domain and includes columns like:

Column Example value
domain_name paypa1-secure-verify.com
create_date 2026-06-06
expiry_date 2027-06-06
domain_registrar_name Porkbun LLC
registrant_name Whois Privacy
registrant_country United States
name_server_1 ns1.porkbun.com
domain_status_1 clienttransferprohibited

WhoisFreaks publishes two daily files: a gTLD file (.com, .net, .org, .io etc.) and a ccTLD file covering country-code TLDs like .uk, .de, and .pk. Each row carries the same 65 columns. Merged, the two files typically run 300,000 to 400,000 rows per day. The full detector downloads and merges both.

Endpoints:

# gTLDs (.com .net .org .io etc.)
curl "https://files.whoisfreaks.com/v3.1/download/domainer/gtld?apiKey=YOUR_KEY&whois=true&date=2026-06-07"

# ccTLDs (.uk .de .pk .ca etc.)
curl "https://files.whoisfreaks.com/v3.1/download/domainer/cctld?apiKey=YOUR_KEY&whois=true&date=2026-06-07"
Enter fullscreen mode Exit fullscreen mode

Setup

git clone https://github.com/WhoisFreaks/wf-python-detect-phishing-domains.git
cd wf-python-detect-phishing-domains
pip install -r requirements.txt
cp config.example.py config.py
Enter fullscreen mode Exit fullscreen mode

Then open config.py and add your API key and brand names. Get a free key at whoisfreaks.com.


Step 1: Download gTLD + ccTLD Files and Merge

import csv, gzip, io, requests

NRD_BASE = "https://files.whoisfreaks.com/v3.1/download/domainer"

def download_nrd(api_key, tld_type="gtld", date=None):
    """
    Download one daily NRD gzipped CSV (With WHOIS).
    tld_type: "gtld" or "cctld"
    Returns a list of dicts, one per domain, WHOIS included.
    """
    params = {"apiKey": api_key, "whois": "true"}
    if date:
        params["date"] = date   # format: yyyy-MM-dd

    resp = requests.get(f"{NRD_BASE}/{tld_type}", params=params, timeout=300)
    resp.raise_for_status()

    content = gzip.decompress(resp.content).decode("utf-8", errors="replace")
    return list(csv.DictReader(io.StringIO(content)))


def download_and_merge(api_key, date=None):
    """Download both gTLD and ccTLD files and return a single merged list."""
    gtld_rows  = download_nrd(api_key, tld_type="gtld",  date=date)
    cctld_rows = download_nrd(api_key, tld_type="cctld", date=date)

    seen, merged = set(), []
    for row in gtld_rows + cctld_rows:
        domain = row.get("domain_name", "").strip().lower()
        if domain and domain not in seen:
            seen.add(domain)
            merged.append(row)
    return merged
Enter fullscreen mode Exit fullscreen mode

Two HTTP requests. Two CSV files merged into one dataset. All WHOIS data is already inside, no further API calls needed.


Step 2: Fuzzy Similarity Scoring (Local, No API)

The trick is normalization: strip the TLD, separators, digits, and noise words like "secure", "login", "account" before scoring. Otherwise paypa1-secure-verify.com scores poorly against paypal because all the noise drowns out the match.

import re
from thefuzz import fuzz

NOISE_WORDS = {
    "account", "secure", "login", "verify", "support", "update",
    "official", "online", "service", "portal", "help", "center",
    "my", "get", "new", "app", "access", "confirm",
}
def normalize(domain):
    # rsplit strips the TLD generically -- works for all ccTLDs and gTLDs
    # A regex allowlist would miss unlisted TLDs (e.g. .de, .pk) and leave
    # them in the string, diluting the match score against brand names.
    core  = domain.lower().rsplit(".", 1)[0]
    parts = re.split(r"[-_.]", core)
    parts = [re.sub(r"\d", "", p) for p in parts]   # strip digits within tokens
    return "".join(p for p in parts if p and p not in NOISE_WORDS)

BRANDS = ["paypal", "amazon", "netflix", "google", "microsoft"]

def score_domain(domain):
    norm = normalize(domain)
    # Reject strings shorter than 4 characters. Digit-heavy domains like
    # "y1288.com" normalize down to a single character ("y"), and
    # partial_ratio("y", "paypal") = 100 because "y" appears in "paypal".
    # A 4-character minimum cuts these false positives cleanly.
    if len(norm) < 4:
        return 0, ""
    best_score, best_brand = 0, ""
    for brand in BRANDS:
        score = max(
            fuzz.ratio(norm, brand),
            fuzz.partial_ratio(norm, brand),
            fuzz.token_sort_ratio(norm, brand),
        )
        if score > best_score:
            best_score, best_brand = score, brand
    return best_score, best_brand
Enter fullscreen mode Exit fullscreen mode

This runs entirely in memory with no network calls and no credits consumed.


Step 3: Extract Signals From Embedded WHOIS

Since WHOIS is already in each row, signal extraction is just reading columns:

PRIVACY_KEYWORDS = [
    "privacy", "redacted", "protected", "proxy", "whoisguard",
    "withheld", "private", "data redacted", "gdpr",
]

def extract_signals(row):
    reg_name = row.get("registrant_name", "").lower()
    return {
        "create_date":        row.get("create_date", ""),
        "expiry_date":        row.get("expiry_date", ""),
        "registrar":          row.get("domain_registrar_name", ""),
        "registrant_name":    row.get("registrant_name", ""),
        "registrant_country": row.get("registrant_country", ""),
        "ns1":                row.get("name_server_1", ""),
        "status":             row.get("domain_status_1", ""),
        "is_private":         any(kw in reg_name for kw in PRIVACY_KEYWORDS),
    }
Enter fullscreen mode Exit fullscreen mode

Step 4: Risk Scoring

Combine similarity and WHOIS signals into a 0 to 100 score:

from datetime import datetime, timezone

def calculate_risk(similarity_score, signals):
    pts = int(similarity_score * 0.4)   # similarity -> max 40 pts

    # Domain age (max 20 pts)
    create_raw = signals.get("create_date", "")
    if create_raw:
        try:
            created  = datetime.strptime(create_raw[:10], "%Y-%m-%d").replace(tzinfo=timezone.utc)
            age_days = (datetime.now(timezone.utc) - created).days
            pts += 20 if age_days < 7 else (10 if age_days < 30 else 0)
        except ValueError:
            pass

    # Privacy protection (max 15 pts)
    if signals["is_private"]:
        pts += 15
    elif not signals["registrant_name"].strip():
        pts += 5

    score = min(pts, 100)
    # Max possible score = 40 (similarity) + 20 (age) + 15 (privacy) = 75
    # CRITICAL threshold is set to 75 so the label is actually reachable
    label = "CRITICAL" if score >= 75 else "HIGH" if score >= 60 else "MEDIUM" if score >= 40 else "LOW"
    return score, label
Enter fullscreen mode Exit fullscreen mode

Putting It All Together

def run(api_key, threshold=70):
    print("Downloading NRD feeds (gTLD + ccTLD)...")
    rows = download_and_merge(api_key)
    print(f"  {len(rows):,} domains loaded.")

    findings = []
    for row in rows:
        domain = row.get("domain_name", "").strip().lower()
        if not domain:
            continue

        score, brand = score_domain(domain)
        if score < threshold:
            continue

        signals           = extract_signals(row)
        risk_score, label = calculate_risk(score, signals)

        findings.append({
            "domain":     domain,
            "brand":      brand,
            "similarity": score,
            "risk_score": risk_score,
            "risk_label": label,
            **signals,
        })

    findings.sort(key=lambda x: x["risk_score"], reverse=True)

    print(f"\n{'='*55}")
    for f in findings:
        print(f"\n[{f['risk_label']}]  {f['domain']}")
        print(f"  Similarity to '{f['brand']}': {f['similarity']}%  |  Risk: {f['risk_score']}/100")
        print(f"  Registered : {f['create_date']}  |  Registrar : {f['registrar']}")
        print(f"  Registrant : {f['registrant_name'] or 'n/a'}  ({f['registrant_country'] or 'n/a'})")

    print(f"\n  {len(findings)} suspicious domains found.")
    return findings

if __name__ == "__main__":
    run(api_key="YOUR_WHOISFREAKS_API_KEY")
Enter fullscreen mode Exit fullscreen mode

Real Results

Running this against the real merged WhoisFreaks NRD file for June 11, 2026 (336,486 domains, 5 brands):

Downloading NRD feeds (gTLD + ccTLD)...
  336,486 domains loaded.

=======================================================

[CRITICAL]  googledocument.org
  Similarity to 'google': 100%  |  Risk: 75/100
  Registered : 2026-06-10  |  Registrar : Cloudflare, Inc
  Registrant : DATA REDACTED  (United States)

[CRITICAL]  account-ads-google.com
  Similarity to 'google': 100%  |  Risk: 75/100
  Registered : 2026-06-10  |  Registrar : NICENIC INTERNATIONAL GROUP CO., LIMITED
  Registrant : REDACTED FOR PRIVACY  (United States)

[CRITICAL]  henryamazon.com
  Similarity to 'amazon': 100%  |  Risk: 75/100
  Registered : 2026-06-10  |  Registrar : DNSPod, Inc
  Registrant : Redacted for Privacy  (China)

[HIGH]  netflixsupport.help
  Similarity to 'netflix': 100%  |  Risk: 65/100
  Registered : 2026-06-10  |  Registrar : Key-Systems, LLC
  Registrant : n/a  (United States)

[HIGH]  amazonsellerpilot.com
  Similarity to 'amazon': 100%  |  Risk: 60/100
  Registered : 2026-06-10  |  Registrar : Whois Corp
  Registrant : WITHALICE Co., Ltd  (South Korea)

  2,912 suspicious domains found.
  CRITICAL: 62   HIGH: 1,197   MEDIUM: 1,653   LOW: 0
Enter fullscreen mode Exit fullscreen mode

2,912 candidates from 336,486 rows, zero separate API calls.


What I Noticed In The Results

The scale is real. A single day's merged NRD dataset contains 336,000+ domains. Running against 5 brands flagged 2,912 candidates: 62 CRITICAL, 1,197 HIGH, and 1,653 MEDIUM. That is not a theoretical threat surface.

The min-length guard matters. Without it, digit-heavy domains like y1288.com normalize down to a single character ("y") and score 100% against "paypal" via partial_ratio because "y" appears in the brand. A 4-character minimum cuts these false positives cleanly. Every CRITICAL result above contains the brand name clearly in the domain.

CRITICAL means what it says. Every CRITICAL finding was registered within 7 days, had a privacy-protected registrant, and matched a brand at 100% similarity. googledocument.org and account-ads-google.com are obvious infrastructure setups. henryamazon.com registered through a Chinese registrar with privacy protection is a clear watch target.

Privacy protection is everywhere. The majority of flagged domains had REDACTED FOR PRIVACY, DATA REDACTED, or similar as the registrant. Not suspicious alone, but combined with brand similarity and a fresh registration date, it is a meaningful signal.

Some are clearly innocent false positives. amazonsellerpilot.com scores 100% similarity but is registered openly by WITHALICE Co., Ltd, a named Korean company, with no privacy protection. That is almost certainly a legitimate seller tool, not phishing. Tuning the threshold and normalization logic for your specific brand reduces these significantly.

Clustering patterns are visible. Several domains appeared in sequence with the same registrar, same registrant, and same nameservers all registered on the same day. That is a registration campaign, not independent registrations.


Going Further

A few directions worth exploring from here:

  • Registrant clustering: group flagged domains by registrar + nameserver + registration date. Clusters indicate coordinated campaigns.
  • Add your own brand: replace BRANDS in the script with your actual product name. One short string is all it takes.
  • Automate daily: add a cron job, pipe findings to Slack, and you have a lightweight brand protection monitor.
  • Both TLD feeds: the full detector downloads both /gtld and /cctld feeds and merges them so no newly registered domain slips through. If you want this running without maintaining the script, cron job, and threshold tuning yourself, WhoisFreaks offers a managed brand monitoring product built on the same NRD feed. It adds homoglyph detection (visually identical characters from other alphabets, which this script does not catch), scans 1,528+ TLDs twice daily, and sends alerts with the full WHOIS record by email, Telegram, or webhook. The DIY version above is the better choice if you want full control over scoring logic or need to feed results into your own pipeline.

Full Source Code

The complete version with proper error handling, config.py, and JSON output is available on GitHub:

github.com/WhoisFreaks/wf-python-detect-phishing-domains

API docs and free signup: whoisfreaks.com/documentation/newly-registered-domains


Questions about the fuzzy matching logic or risk scoring? Happy to discuss in the comments.

Top comments (0)