Mohammad Waseem

Posted on Feb 2

Detecting Phishing Patterns with Python: A Zero-Budget Approach

#security #python #phishing

Detecting Phishing Patterns with Python: A Zero-Budget Approach

Cybersecurity threats continue to evolve, with phishing being one of the most prevalent vectors for data breaches and credential theft. While commercial solutions exist, a security researcher on a limited or zero budget can still build effective detection mechanisms leveraging open-source tools and intelligent pattern analysis. This article explores how to utilize Python to identify common phishing patterns, focusing on free and accessible resources.

Understanding Phishing Patterns

Phishing emails often follow certain recognizable patterns, such as suspicious URLs, deceptive domain names, mismatched link text, and unusual email metadata. Detecting these patterns requires analyzing email components and URLs, checking for anomalies, and leveraging heuristics.

Step 1: Collecting Data

Since this is a zero-budget solution, we will utilize publicly available email datasets or craft sample emails mimicking phishing characteristics. For demonstration, consider a collection of email snippets or URLs that may contain signs of phishing.

Step 2: Extracting URLs and Email Features

Using Python, extract URLs from email text and analyze their structure. The re module can extract URLs, and tldextract (which is free to install and uses public suffix lists) helps parse domain components.

import re
import tldextract

def extract_urls(email_text):
    url_pattern = r'https?://[\w\.-/+]+'
    return re.findall(url_pattern, email_text)

sample_email = "Please verify your account at http://secure-login-example.com/update instead of our usual site."
urls = extract_urls(sample_email)
print('Extracted URLs:', urls)

for url in urls:
    parsed = tldextract.extract(url)
    print(f"Domain: {parsed.domain}, Suffix: {parsed.suffix}")

This script extracts URLs and parses their domain components, which are crucial in detecting suspicious domains.

Step 3: Detecting Suspicious Domains

Many phishing sites deploy domains that mimic legitimate ones or use unusual TLDs. To identify these, compare extracted domains against a whitelist of reputable domains or detect anomalies such as:

Use of uncommon TLDs (.xyz, .top, etc.)
Newly registered domains (using public WHOIS data or domain age APIs)
Domains with unusual character patterns

Since there's no budget, you can leverage free APIs like impersonation or open datasets for known malicious domains.

# Example: Check if domain is in a suspicious list
suspicious_domains = ['secure-login-example.com', 'verify-account.co']
for url in urls:
    domain = tldextract.extract(url).domain + '.' + tldextract.extract(url).suffix
    if domain in suspicious_domains:
        print(f"Suspicious domain detected: {domain}")

Step 4: Pattern Recognition and Heuristics

Combine features to flag phishing emails:

URL domain similarity to known brands
URL path length
Presence of IP addresses instead of domain names
Mismatch between link text and URL

Here's an example heuristic: flag URLs with IP addresses as suspicious.

import ipaddress

def is_ip_address(url):
    match = re.search(r'\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b', url)
    if match:
        try:
            ipaddress.ip_address(match.group())
            return True
        except ValueError:
            return False
    return False

for url in urls:
    if is_ip_address(url):
        print(f"Possible phishing URL (IP address): {url}")

Step 5: Building a Simple Detector

By combining these heuristics, you can build a basic phishing detector program. For example:

def detect_phishing(email_text):
    urls = extract_urls(email_text)
    for url in urls:
        domain_obj = tldextract.extract(url)
        domain = domain_obj.domain + '.' + domain_obj.suffix
        # Check suspicious domains
        if domain in suspicious_domains or is_ip_address(url):
            return True
    return False

# Example usage
email = "Update your account at http://192.168.1.1/login now."
if detect_phishing(email):
    print('Potential phishing detected')
else:
    print('No phishing patterns detected')

Conclusion

A zero-budget approach to detecting phishing relies on pattern recognition, heuristic rules, and open-source data. While not as comprehensive as commercial solutions, this methodology provides a practical starting point for researchers and security teams to identify suspicious emails and links. By continuously updating domain lists, refining heuristics, and incorporating machine learning models (using free datasets), the detection capability can be incrementally improved over time.

Remember: Always validate findings with multiple indicators and stay updated on evolving phishing tactics.

Resources:

re and tldextract (Python modules)
Public domain and WHOIS data APIs
Open source datasets like PhishTank or MalwareDomainList

🛠️ QA Tip

To test this safely without using real user data, I use TempoMail USA.

DEV Community

Detecting Phishing Patterns with Python: A Zero-Budget Approach

Detecting Phishing Patterns with Python: A Zero-Budget Approach

Understanding Phishing Patterns

Step 1: Collecting Data

Step 2: Extracting URLs and Email Features

Step 3: Detecting Suspicious Domains

Step 4: Pattern Recognition and Heuristics

Step 5: Building a Simple Detector

Conclusion

🛠️ QA Tip

Top comments (0)