DEV Community

Mohammad Waseem
Mohammad Waseem

Posted on

Detecting Phishing Patterns in URLs Using Python: A Practical Approach

Detecting Phishing Patterns in URLs Using Python: A Practical Approach

Phishing remains a prevalent cybersecurity threat, leveraging malicious URLs to deceive users and steal sensitive information. As security researchers and developers, creating effective detection techniques is crucial. In this article, we'll explore how to identify common phishing patterns in URLs using Python, focusing on pattern recognition and heuristic analysis.

Understanding Phishing URL Patterns

Phishers often craft URLs that mimic legitimate domains with subtle obfuscations, including:

  • Use of subdomains (e.g., login.bank.com.evil.com)
  • URL encoding to hide malicious parts
  • Homograph attacks with Unicode characters
  • Excessive URL length and suspicious parameters

By analyzing these traits, our detection script can flag potentially malicious URLs.

Implementation Strategy

In this example, we'll employ Python's re module for pattern matching and tldextract for domain parsing. Our goal is to create a heuristic-based system that detects common phishing signs.

Step 1: Extract Domain and Subdomains

import tldextract

def extract_domain_info(url):
    extracted = tldextract.extract(url)
    domain = extracted.domain
    subdomain = extracted.subdomain
    tld = extracted.suffix
    return domain, subdomain, tld
Enter fullscreen mode Exit fullscreen mode

Step 2: Detect Suspicious Subdomain Patterns

import re

def has_suspicious_subdomains(subdomain):
    # Checks for multiple nested subdomains or long subdomain labels
    if not subdomain:
        return False
    parts = subdomain.split('.')
    if len(parts) > 3:
        return True
    for part in parts:
        # Detect long or encoded subdomains
        if len(part) > 20 or re.search(r'%[0-9A-Fa-f]{2}', part):
            return True
    return False
Enter fullscreen mode Exit fullscreen mode

Step 3: Look for URL Obfuscations and Unicode Homographs

def contains_homograph(url):
    # Check for Unicode characters that resemble Latin letters
    suspicious_chars = [
        'а',  # Cyrillic 'a'
        'α',  # Greek alpha
        'ӏ'   # Cyrillic 'ө'
    ]
    for ch in suspicious_chars:
        if ch in url:
            return True
    return False
Enter fullscreen mode Exit fullscreen mode

Step 4: Generate a Detection Heuristic

def is_phishing_url(url):
    domain, subdomain, tld = extract_domain_info(url)
    if has_suspicious_subdomains(subdomain):
        return True
    if contains_homograph(url):
        return True
    # Check for no HTTPS
    if not url.startswith('https'):
        return True
    # Excessive URL length
    if len(url) > 100:
        return True
    # Look for suspicious parameters
    if re.search(r'([\?&]%?[0-9A-Fa-f]{2})+', url):
        return True
    return False
Enter fullscreen mode Exit fullscreen mode

Practical Usage

Let's test our heuristic with some examples:

test_urls = [
    'http://login.bank.com.evil.com',
    'https://secure.paypal.com',
    'http://xn--pypi-7qa.com',  # Homograph attack
    'http://verylongsubdomainname.example.com',
    'http://update.ҳѵ.com',  # Mix of Unicode
    'http://legit-site.com/login?session=%3A%3Aexpire%3A%3A',
]

for url in test_urls:
    result = is_phishing_url(url)
    print(f"URL: {url} -> Phishing: {result}")
Enter fullscreen mode Exit fullscreen mode

This simple heuristic provides a foundational approach to phishing detection based on pattern recognition. Combining these techniques with machine learning models can further improve accuracy.

Conclusion

Developing a Python-based pattern detection system for phishing URLs requires understanding common obfuscation tactics. By incorporating domain analysis, subdomain scrutiny, Unicode checks, and heuristic rules, security professionals can bolster their defensive toolkit. Continuous updates and integration with real-time detection systems are essential for keeping pace with evolving threats.

Note: Always complement heuristic detection with comprehensive security measures, including user education and multi-layered defenses.


References:

  1. Zhou, L., & Gu, G. (2018). An Effective Approach for Detecting Phishing Websites. Security and Communication Networks. [Link]
  2. Sato, A., & Yoshida, K. (2020). Phishing URL Detection Using Machine Learning and Heuristics. IEEE Transactions on Information Forensics and Security. [Link]

This code and methodology serve as a starting point for building more robust phishing detection tools.


🛠️ QA Tip

To test this safely without using real user data, I use TempoMail USA.

Top comments (0)