Detecting Phishing Patterns in URLs Using Python: A Practical Approach
Phishing remains a prevalent cybersecurity threat, leveraging malicious URLs to deceive users and steal sensitive information. As security researchers and developers, creating effective detection techniques is crucial. In this article, we'll explore how to identify common phishing patterns in URLs using Python, focusing on pattern recognition and heuristic analysis.
Understanding Phishing URL Patterns
Phishers often craft URLs that mimic legitimate domains with subtle obfuscations, including:
- Use of subdomains (e.g.,
login.bank.com.evil.com) - URL encoding to hide malicious parts
- Homograph attacks with Unicode characters
- Excessive URL length and suspicious parameters
By analyzing these traits, our detection script can flag potentially malicious URLs.
Implementation Strategy
In this example, we'll employ Python's re module for pattern matching and tldextract for domain parsing. Our goal is to create a heuristic-based system that detects common phishing signs.
Step 1: Extract Domain and Subdomains
import tldextract
def extract_domain_info(url):
extracted = tldextract.extract(url)
domain = extracted.domain
subdomain = extracted.subdomain
tld = extracted.suffix
return domain, subdomain, tld
Step 2: Detect Suspicious Subdomain Patterns
import re
def has_suspicious_subdomains(subdomain):
# Checks for multiple nested subdomains or long subdomain labels
if not subdomain:
return False
parts = subdomain.split('.')
if len(parts) > 3:
return True
for part in parts:
# Detect long or encoded subdomains
if len(part) > 20 or re.search(r'%[0-9A-Fa-f]{2}', part):
return True
return False
Step 3: Look for URL Obfuscations and Unicode Homographs
def contains_homograph(url):
# Check for Unicode characters that resemble Latin letters
suspicious_chars = [
'а', # Cyrillic 'a'
'α', # Greek alpha
'ӏ' # Cyrillic 'ө'
]
for ch in suspicious_chars:
if ch in url:
return True
return False
Step 4: Generate a Detection Heuristic
def is_phishing_url(url):
domain, subdomain, tld = extract_domain_info(url)
if has_suspicious_subdomains(subdomain):
return True
if contains_homograph(url):
return True
# Check for no HTTPS
if not url.startswith('https'):
return True
# Excessive URL length
if len(url) > 100:
return True
# Look for suspicious parameters
if re.search(r'([\?&]%?[0-9A-Fa-f]{2})+', url):
return True
return False
Practical Usage
Let's test our heuristic with some examples:
test_urls = [
'http://login.bank.com.evil.com',
'https://secure.paypal.com',
'http://xn--pypi-7qa.com', # Homograph attack
'http://verylongsubdomainname.example.com',
'http://update.ҳѵ.com', # Mix of Unicode
'http://legit-site.com/login?session=%3A%3Aexpire%3A%3A',
]
for url in test_urls:
result = is_phishing_url(url)
print(f"URL: {url} -> Phishing: {result}")
This simple heuristic provides a foundational approach to phishing detection based on pattern recognition. Combining these techniques with machine learning models can further improve accuracy.
Conclusion
Developing a Python-based pattern detection system for phishing URLs requires understanding common obfuscation tactics. By incorporating domain analysis, subdomain scrutiny, Unicode checks, and heuristic rules, security professionals can bolster their defensive toolkit. Continuous updates and integration with real-time detection systems are essential for keeping pace with evolving threats.
Note: Always complement heuristic detection with comprehensive security measures, including user education and multi-layered defenses.
References:
- Zhou, L., & Gu, G. (2018). An Effective Approach for Detecting Phishing Websites. Security and Communication Networks. [Link]
- Sato, A., & Yoshida, K. (2020). Phishing URL Detection Using Machine Learning and Heuristics. IEEE Transactions on Information Forensics and Security. [Link]
This code and methodology serve as a starting point for building more robust phishing detection tools.
🛠️ QA Tip
To test this safely without using real user data, I use TempoMail USA.
Top comments (0)