Most phishing detection tools are reactive. They wait for a domain to be reported, then add it to a blocklist. By that point, the campaign has already been running for hours or days.
I wanted to flip that. Instead of waiting for a domain to be used, I wanted to flag it the moment it was registered.
Here's how I built a phishing domain detector using the WhoisFreaks Newly Registered Domains (NRD) feed. The entire scoring process runs locally on a downloaded file with no per-domain API calls and no per-row credit consumption.
The Key Insight: The Data Is Already There
Most WHOIS-based tools work like this:
Download list of domains -> loop through -> call WHOIS API for each one
That's slow and expensive. WhoisFreaks' NRD feed works differently. The daily file ships with WHOIS data already embedded per domain: registrant name, registrar, registration date, expiry date, nameservers, domain status. All 65 columns, one row per domain, one file download.
So the workflow becomes:
Download gzipped CSV -> parse locally -> score everything -> done
No loop of API calls. No per-domain credit burn. You get the full picture in a single download.
What the NRD File Looks Like
The WhoisFreaks NRD "With WHOIS" endpoint returns a gzipped CSV. Each row covers one newly registered domain and includes columns like:
| Column | Example value |
|---|---|
domain_name |
paypa1-secure-verify.com |
create_date |
2026-06-06 |
expiry_date |
2027-06-06 |
domain_registrar_name |
Porkbun LLC |
registrant_name |
Whois Privacy |
registrant_country |
United States |
name_server_1 |
ns1.porkbun.com |
domain_status_1 |
clienttransferprohibited |
WhoisFreaks publishes two daily files: a gTLD file (.com, .net, .org, .io etc.) and a ccTLD file covering country-code TLDs like .uk, .de, and .pk. Each row carries the same 65 columns. Merged, the two files typically run 300,000 to 400,000 rows per day. The full detector downloads and merges both.
Endpoints:
# gTLDs (.com .net .org .io etc.)
curl "https://files.whoisfreaks.com/v3.1/download/domainer/gtld?apiKey=YOUR_KEY&whois=true&date=2026-06-07"
# ccTLDs (.uk .de .pk .ca etc.)
curl "https://files.whoisfreaks.com/v3.1/download/domainer/cctld?apiKey=YOUR_KEY&whois=true&date=2026-06-07"
-
whois=truerequests the With-WHOIS variant -
dateis optional, omit it for the most recent available file Full docs: whoisfreaks.com/documentation/newly-registered-domains
Setup
git clone https://github.com/WhoisFreaks/wf-python-detect-phishing-domains.git
cd wf-python-detect-phishing-domains
pip install -r requirements.txt
cp config.example.py config.py
Then open config.py and add your API key and brand names. Get a free key at whoisfreaks.com.
Step 1: Download gTLD + ccTLD Files and Merge
import csv, gzip, io, requests
NRD_BASE = "https://files.whoisfreaks.com/v3.1/download/domainer"
def download_nrd(api_key, tld_type="gtld", date=None):
"""
Download one daily NRD gzipped CSV (With WHOIS).
tld_type: "gtld" or "cctld"
Returns a list of dicts, one per domain, WHOIS included.
"""
params = {"apiKey": api_key, "whois": "true"}
if date:
params["date"] = date # format: yyyy-MM-dd
resp = requests.get(f"{NRD_BASE}/{tld_type}", params=params, timeout=300)
resp.raise_for_status()
content = gzip.decompress(resp.content).decode("utf-8", errors="replace")
return list(csv.DictReader(io.StringIO(content)))
def download_and_merge(api_key, date=None):
"""Download both gTLD and ccTLD files and return a single merged list."""
gtld_rows = download_nrd(api_key, tld_type="gtld", date=date)
cctld_rows = download_nrd(api_key, tld_type="cctld", date=date)
seen, merged = set(), []
for row in gtld_rows + cctld_rows:
domain = row.get("domain_name", "").strip().lower()
if domain and domain not in seen:
seen.add(domain)
merged.append(row)
return merged
Two HTTP requests. Two CSV files merged into one dataset. All WHOIS data is already inside, no further API calls needed.
Step 2: Fuzzy Similarity Scoring (Local, No API)
The trick is normalization: strip the TLD, separators, digits, and noise words like "secure", "login", "account" before scoring. Otherwise paypa1-secure-verify.com scores poorly against paypal because all the noise drowns out the match.
import re
from thefuzz import fuzz
NOISE_WORDS = {
"account", "secure", "login", "verify", "support", "update",
"official", "online", "service", "portal", "help", "center",
"my", "get", "new", "app", "access", "confirm",
}
def normalize(domain):
# rsplit strips the TLD generically -- works for all ccTLDs and gTLDs
# A regex allowlist would miss unlisted TLDs (e.g. .de, .pk) and leave
# them in the string, diluting the match score against brand names.
core = domain.lower().rsplit(".", 1)[0]
parts = re.split(r"[-_.]", core)
parts = [re.sub(r"\d", "", p) for p in parts] # strip digits within tokens
return "".join(p for p in parts if p and p not in NOISE_WORDS)
BRANDS = ["paypal", "amazon", "netflix", "google", "microsoft"]
def score_domain(domain):
norm = normalize(domain)
# Reject strings shorter than 4 characters. Digit-heavy domains like
# "y1288.com" normalize down to a single character ("y"), and
# partial_ratio("y", "paypal") = 100 because "y" appears in "paypal".
# A 4-character minimum cuts these false positives cleanly.
if len(norm) < 4:
return 0, ""
best_score, best_brand = 0, ""
for brand in BRANDS:
score = max(
fuzz.ratio(norm, brand),
fuzz.partial_ratio(norm, brand),
fuzz.token_sort_ratio(norm, brand),
)
if score > best_score:
best_score, best_brand = score, brand
return best_score, best_brand
This runs entirely in memory with no network calls and no credits consumed.
Step 3: Extract Signals From Embedded WHOIS
Since WHOIS is already in each row, signal extraction is just reading columns:
PRIVACY_KEYWORDS = [
"privacy", "redacted", "protected", "proxy", "whoisguard",
"withheld", "private", "data redacted", "gdpr",
]
def extract_signals(row):
reg_name = row.get("registrant_name", "").lower()
return {
"create_date": row.get("create_date", ""),
"expiry_date": row.get("expiry_date", ""),
"registrar": row.get("domain_registrar_name", ""),
"registrant_name": row.get("registrant_name", ""),
"registrant_country": row.get("registrant_country", ""),
"ns1": row.get("name_server_1", ""),
"status": row.get("domain_status_1", ""),
"is_private": any(kw in reg_name for kw in PRIVACY_KEYWORDS),
}
Step 4: Risk Scoring
Combine similarity and WHOIS signals into a 0 to 100 score:
from datetime import datetime, timezone
def calculate_risk(similarity_score, signals):
pts = int(similarity_score * 0.4) # similarity -> max 40 pts
# Domain age (max 20 pts)
create_raw = signals.get("create_date", "")
if create_raw:
try:
created = datetime.strptime(create_raw[:10], "%Y-%m-%d").replace(tzinfo=timezone.utc)
age_days = (datetime.now(timezone.utc) - created).days
pts += 20 if age_days < 7 else (10 if age_days < 30 else 0)
except ValueError:
pass
# Privacy protection (max 15 pts)
if signals["is_private"]:
pts += 15
elif not signals["registrant_name"].strip():
pts += 5
score = min(pts, 100)
# Max possible score = 40 (similarity) + 20 (age) + 15 (privacy) = 75
# CRITICAL threshold is set to 75 so the label is actually reachable
label = "CRITICAL" if score >= 75 else "HIGH" if score >= 60 else "MEDIUM" if score >= 40 else "LOW"
return score, label
Putting It All Together
def run(api_key, threshold=70):
print("Downloading NRD feeds (gTLD + ccTLD)...")
rows = download_and_merge(api_key)
print(f" {len(rows):,} domains loaded.")
findings = []
for row in rows:
domain = row.get("domain_name", "").strip().lower()
if not domain:
continue
score, brand = score_domain(domain)
if score < threshold:
continue
signals = extract_signals(row)
risk_score, label = calculate_risk(score, signals)
findings.append({
"domain": domain,
"brand": brand,
"similarity": score,
"risk_score": risk_score,
"risk_label": label,
**signals,
})
findings.sort(key=lambda x: x["risk_score"], reverse=True)
print(f"\n{'='*55}")
for f in findings:
print(f"\n[{f['risk_label']}] {f['domain']}")
print(f" Similarity to '{f['brand']}': {f['similarity']}% | Risk: {f['risk_score']}/100")
print(f" Registered : {f['create_date']} | Registrar : {f['registrar']}")
print(f" Registrant : {f['registrant_name'] or 'n/a'} ({f['registrant_country'] or 'n/a'})")
print(f"\n {len(findings)} suspicious domains found.")
return findings
if __name__ == "__main__":
run(api_key="YOUR_WHOISFREAKS_API_KEY")
Real Results
Running this against the real merged WhoisFreaks NRD file for June 11, 2026 (336,486 domains, 5 brands):
Downloading NRD feeds (gTLD + ccTLD)...
336,486 domains loaded.
=======================================================
[CRITICAL] googledocument.org
Similarity to 'google': 100% | Risk: 75/100
Registered : 2026-06-10 | Registrar : Cloudflare, Inc
Registrant : DATA REDACTED (United States)
[CRITICAL] account-ads-google.com
Similarity to 'google': 100% | Risk: 75/100
Registered : 2026-06-10 | Registrar : NICENIC INTERNATIONAL GROUP CO., LIMITED
Registrant : REDACTED FOR PRIVACY (United States)
[CRITICAL] henryamazon.com
Similarity to 'amazon': 100% | Risk: 75/100
Registered : 2026-06-10 | Registrar : DNSPod, Inc
Registrant : Redacted for Privacy (China)
[HIGH] netflixsupport.help
Similarity to 'netflix': 100% | Risk: 65/100
Registered : 2026-06-10 | Registrar : Key-Systems, LLC
Registrant : n/a (United States)
[HIGH] amazonsellerpilot.com
Similarity to 'amazon': 100% | Risk: 60/100
Registered : 2026-06-10 | Registrar : Whois Corp
Registrant : WITHALICE Co., Ltd (South Korea)
2,912 suspicious domains found.
CRITICAL: 62 HIGH: 1,197 MEDIUM: 1,653 LOW: 0
2,912 candidates from 336,486 rows, zero separate API calls.
What I Noticed In The Results
The scale is real. A single day's merged NRD dataset contains 336,000+ domains. Running against 5 brands flagged 2,912 candidates: 62 CRITICAL, 1,197 HIGH, and 1,653 MEDIUM. That is not a theoretical threat surface.
The min-length guard matters. Without it, digit-heavy domains like y1288.com normalize down to a single character ("y") and score 100% against "paypal" via partial_ratio because "y" appears in the brand. A 4-character minimum cuts these false positives cleanly. Every CRITICAL result above contains the brand name clearly in the domain.
CRITICAL means what it says. Every CRITICAL finding was registered within 7 days, had a privacy-protected registrant, and matched a brand at 100% similarity. googledocument.org and account-ads-google.com are obvious infrastructure setups. henryamazon.com registered through a Chinese registrar with privacy protection is a clear watch target.
Privacy protection is everywhere. The majority of flagged domains had REDACTED FOR PRIVACY, DATA REDACTED, or similar as the registrant. Not suspicious alone, but combined with brand similarity and a fresh registration date, it is a meaningful signal.
Some are clearly innocent false positives. amazonsellerpilot.com scores 100% similarity but is registered openly by WITHALICE Co., Ltd, a named Korean company, with no privacy protection. That is almost certainly a legitimate seller tool, not phishing. Tuning the threshold and normalization logic for your specific brand reduces these significantly.
Clustering patterns are visible. Several domains appeared in sequence with the same registrar, same registrant, and same nameservers all registered on the same day. That is a registration campaign, not independent registrations.
Going Further
A few directions worth exploring from here:
- Registrant clustering: group flagged domains by registrar + nameserver + registration date. Clusters indicate coordinated campaigns.
-
Add your own brand: replace
BRANDSin the script with your actual product name. One short string is all it takes. - Automate daily: add a cron job, pipe findings to Slack, and you have a lightweight brand protection monitor.
-
Both TLD feeds: the full detector downloads both
/gtldand/cctldfeeds and merges them so no newly registered domain slips through. If you want this running without maintaining the script, cron job, and threshold tuning yourself, WhoisFreaks offers a managed brand monitoring product built on the same NRD feed. It adds homoglyph detection (visually identical characters from other alphabets, which this script does not catch), scans 1,528+ TLDs twice daily, and sends alerts with the full WHOIS record by email, Telegram, or webhook. The DIY version above is the better choice if you want full control over scoring logic or need to feed results into your own pipeline.
Full Source Code
The complete version with proper error handling, config.py, and JSON output is available on GitHub:
github.com/WhoisFreaks/wf-python-detect-phishing-domains
API docs and free signup: whoisfreaks.com/documentation/newly-registered-domains
Questions about the fuzzy matching logic or risk scoring? Happy to discuss in the comments.
Top comments (0)