Freemen HOUNGBEDJI

Posted on May 9

I Built a URL Threat Analyzer That Detects Phishing in Real-Time — Here's How It Works published

#python #webdev #security

Every day, millions of people click malicious URLs without knowing it. Phishing pages look legitimate. Shortened links hide their destination. Freshly registered domains slip past blockers.
I got tired of copy-pasting URLs into clunky online scanners and waiting 10 seconds for a result — so I built SnifURL.
Live
GitHub

What is SnifURL?

SnifURL is a real-time URL threat analyzer. You paste a URL, it runs it through 13 heuristic checks in parallel, and returns a risk score from 0 to 100 with a full breakdown of every signal it found.
No black box. No "we flagged it, trust us." Every point added or subtracted is explained.

{ "url": "https://paypa1-secure-login.tk/verify", "score": 91, "risk_level": "CRITICAL", "recommendation": "Phishing almost certain — block immediately", "details": [ "Suspicious TLD (.tk — Freenom free domain) (+18)", "Brand impersonation detected: 'paypa1' looks like 'paypal' (+25)", "Domain registered 3 days ago (+20)", "Homograph character detected: '1' replacing 'l' (+15)", "No valid SSL certificate (+13)" ] }

The Risk Score System

🟢 0–14 — SAFE No significant indicators. The URL looks clean.
🟡 15–34 — LOW Probably safe, but worth a second look before clicking.
🟠 35–54 — MEDIUM Suspicious. Inspect manually before trusting it.
🔴 55–74 — HIGH Very suspicious — block it unless you're 100% sure of the source.
🚨 75–100 — CRITICAL Phishing almost certain. Block immediately.
The score is additive: each indicator adds or subtracts points based on its confidence weight. This makes results explainable and debuggable.

The 13 Heuristic Indicators

Here's exactly what SnifURL checks — and why each one matters.
1. TLD Reputation
Free TLDs like .tk, .ml, .cf, .ga, .gq (Freenom) are massively over-represented in phishing campaigns. Crypto TLDs like .xyz, .top also rank high. High-risk ccTLDs are weighted accordingly.

2. Direct IP in URL
http://192.168.1.1/login — no legitimate service asks you to log in via a raw IP address. This is a strong phishing signal.

3. Brand Impersonation
The engine scans the subdomain and path against a dictionary of major brands (paypal, google, apple, amazon, microsoft, netflix...) and checks for typosquatting variations.

4. Homograph Attacks
Unicode lookalike characters are a sneaky attack vector. pаypal.com with a Cyrillic а looks identical to paypal.com. SnifURL detects non-ASCII characters and flags them.

5. SSL Certificate Validity
Checks if the certificate is valid and evaluates issuer trust. A self-signed cert on a "bank login" page? Red flag.

6. WHOIS Domain Age
New domains are suspicious. A domain registered 2 days ago asking for your password is a massive red flag. SnifURL queries WHOIS and penalizes recently created domains.

7. DNS Resolution
If the domain doesn't even resolve — it's either dead or a trap. Simple check, non-zero value.

URL Shorteners bit.ly, tinyurl.com, t.co — shorteners hide the real destination. SnifURL flags them and encourages redirect inspection.

9. Double File Extensions
invoice.pdf.exe is a classic malware trick. The engine scans the URL path for chained extensions.

10. @ Character in URL
http://legitimate.com@evil.com/phish — browsers follow the part after @. This old trick still catches people off guard.

11. Non-Standard Ports
https://mybank.com:8080/login is a sign something is off. Legit services run on 443.

12. Excessive URL Encoding
%68%74%74%70%73%3A%2F%2F... — heavy encoding often signals an attempt to obfuscate a malicious destination.

13. Subdomain Depth & Hyphens
secure-login-verify.paypal.accounts.malicious.com — deep subdomains and excessive hyphens are classic phishing patterns.

How the Parallel Analysis Works

Network checks (DNS, WHOIS, SSL) are the slow part. Running them sequentially would add 3–5 seconds of latency. Instead, SnifURL runs them concurrently with Python's ThreadPoolExecutor:
python# analyseur_url.py (simplified)
from concurrent.futures import ThreadPoolExecutor, as_completed

def analyze(url: str) -> dict:
results = {}

with ThreadPoolExecutor(max_workers=4) as executor:
    futures = {
        executor.submit(check_ssl, url): "ssl",
        executor.submit(check_whois, url): "whois",
        executor.submit(check_dns, url): "dns",
        executor.submit(check_redirects, url): "redirects",
    }
    for future in as_completed(futures):
        key = futures[future]
        results[key] = future.result()

# Heuristic checks (no I/O — instant)
results["tld"] = check_tld(url)
results["homograph"] = check_homograph(url)
results["brand"] = check_brand_impersonation(url)
# ...

return scoring.compute(results)

Total analysis time on most URLs: under 2 seconds.

The Scoring Engine

scoring.py takes the raw results and computes the final score. Each signal has a weight based on its phishing correlation strength:
python# scoring.py (simplified)
WEIGHTS = {
"suspicious_tld": 18,
"brand_impersonation": 25,
"homograph": 15,
"no_ssl": 13,
"recently_registered": 20, # < 30 days
"direct_ip": 22,
"url_shortener": 10,
"double_extension": 15,
"at_char": 12,
"non_standard_port": 8,
"excessive_encoding": 10,
"subdomain_depth": 8,
}

def compute(results: dict) -> dict:
score = 50 # neutral start
details = []

for key, weight in WEIGHTS.items():
    if results.get(key):
        score += weight
        details.append(f"{DESCRIPTIONS[key]} (+{weight})")

# Legitimate signals reduce the score
if results.get("ssl_valid"):
    score -= 10
if results.get("domain_age_days", 0) > 365:
    score -= 12
# ...

score = max(0, min(100, score))
return { "score": score, "risk_level": get_level(score), "details": details }

The weights were calibrated manually against a dataset of known phishing URLs from PhishTank and legitimate domains. It's not ML — it's deliberate, transparent logic.

The API

Everything is accessible via a simple REST API:
curl -X POST https://snifurl.online/analyze \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com"}'
{ "url": "https://example.com", "score": 12, "risk_level": "SAFE", "recommendation": "LEGITIMATE — No significant indicators", "details": ["Known legitimate root domain (example.com) (-12)"], "indicators": { "uses_https": true, "dns_exists": true, "has_ip": false, "suspicious_tld": null, "recently_created": false, "ssl_certificate": { "valid": true, "issuer_org": "DigiCert" }, "whois": { "age_days": 9862, "registrar": "..." } } }
You can integrate this into your own app, browser extension, or Slack bot to flag links before users click them.

Stack & Deployment

Backend → Python 3.11 / Flask
Analysis engine → analyseur_url.py + scoring.py
Network checks → dnspython, python-whois, ssl (parallel execution)
Frontend → Vanilla HTML / CSS / JS
Server → Vultr VPS / Ubuntu 22.04 / Nginx + Gunicorn

No framework overhead on the frontend. The UI is intentionally lean — the value is in the analysis, not the interface.

Run It Yourself

git clone https://github.com/FreemenTech/Snifurl cd snifurl pip install -r requirements.txt python app.py

Open http://localhost:5000 — that's it.

What's Next

A few things I'm thinking about:

Browser extension — flag URLs inline before you click
Bulk analysis endpoint — scan a list of URLs in one request
ML scoring layer — train a model on PhishTank data to complement the heuristics
Redirect chain analysis — follow shorteners and score the final destination

Try It

🔗Website
Paste a suspicious URL you've received recently .I'd be curious what score it gets. Drop it in the comments.
And if you find a false positive or a phishing URL that slips through open an issue. That's how the weights get better.

Built by Freemen HOUNGBEDJI — MIT License

DEV Community

I Built a URL Threat Analyzer That Detects Phishing in Real-Time — Here's How It Works published

Top comments (0)