Maxim Alex

Posted on Dec 30, 2025

Building a CMS-Level Firewall: Why Application Context Matters

#webdev #security #productivity #architecture

I develop websites on ProcessWire CMS, and I use Plausible Analytics for privacy-focused traffic monitoring. One day I noticed something odd: huge traffic spikes from Singapore. Hundreds of "visitors" daily, but zero actual engagement. Fake browsers hitting product pages, crawling everything, driving up server load.

My first thought was simple: "wouldn't it be cool to just block Singapore completely?" Like China's Great Firewall, but for my website. The primary goal was stopping scrapers from collecting data—price monitoring bots, content thieves, aggressive crawlers.

That's how WireWall started. I got so enthusiastic about blocking traffic that I accidentally blocked Google. Got a notification from Search Console on December 17, 2025:

"Blocked due to access forbidden (403)"

Search Console has identified that some pages on your site are not being indexed due to the following new reason: Blocked due to access forbidden (403)

Oops. My geographic blocking rules were hitting Googlebot because Google crawls from multiple countries, including some I had blocked. This taught me a critical lesson: you need whitelists, not just blocklists.

That's when I realized I needed whitelists. Not just IP-based—ASN whitelists (entire network ranges for legitimate services), User-Agent whitelists (verified crawlers), and a priority system where whitelists always win over geographic blocks.

What started as "block Singapore" evolved into a full CMS-level firewall with 10 priority levels, fake browser detection, VPN checks, and application context awareness. After 2 months in production across multiple e-commerce and content sites, I've blocked over 169,000 requests with 99.93% accuracy.

The Problem with Traditional Firewalls

Server-level firewalls (ModSecurity, fail2ban) work at the HTTP layer. They see requests:

GET /product/expensive-wine HTTP/1.1
User-Agent: Mozilla/5.0...

They can block SQL injection patterns, detect suspicious user agents, rate-limit IPs. But they can't answer: "Is this user actually logged in?" or "Should this account have admin access?"

CDN firewalls (Cloudflare WAF, Sucuri) add bot detection, ML scoring, DDoS protection. Better, but still blind to who's behind the request. They see a valid session cookie but can't verify it's legitimate. They see an admin action but can't check if that user is actually an admin.

CMS-level firewalls (Wordfence for WordPress, RSFirewall for Joomla) execute inside your application. They see everything: user roles, session validity, plugin context, whether a workflow step was skipped.

The unique WireWall advantage: Because WireWall runs inside ProcessWire, it can whitelist AJAX requests from trusted ProcessWire modules (Priority 0.5). This prevents conflicts with legitimate admin operations—something external firewalls can't do because they don't understand what ProcessWire modules are.

Why CMS-Level Detection Works

Here's a real attack that bypassed my CDN but got caught immediately at the application level:

POST /admin/create-user HTTP/1.1
Cookie: session=abc123...

From the network perspective: valid HTTP, proper headers, legitimate session cookie, no SQL injection. Cloudflare let it through.

From the application perspective: this IP just logged in as a regular user 10 seconds ago, now suddenly has admin cookies? The session token doesn't match the claimed user ID? Blocked.

Traditional firewalls can't make that connection because they process each request in isolation.

Architecture: Multi-Level Priority System

Most firewalls work on simple allow/block lists. WireWall uses a cascading priority system where each check has a specific order:

Priority 0:   Admin area (NEVER blocked - triple-layer protection)
Priority 0.5: Trusted ProcessWire module AJAX (prevents module conflicts)
Priority 1:   IP Whitelist (your office, trusted IPs)
Priority 1.5: Allowed bots/IPs/ASNs (Googlebot, Microsoft, etc.)
Priority 2:   Rate limiting (blocks request floods)
Priority 3:   IP Blacklist (known bad actors)
Priority 4:   JS Challenge (suspicious requests must solve challenge)
Priority 5:   VPN/Proxy/Tor detection
Priority 6:   Datacenter blocking (AWS, Google Cloud, etc.)
Priority 7:   ASN blocking (block entire networks)
Priority 8:   Global rules (bad bots like scrapy, curl, sqlmap)
Priority 9:   Country blocking (geographic restrictions)
Priority 9.5: City blocking (Singapore, Beijing, etc.)
Priority 9.6: Subdivision blocking (Pennsylvania, California, etc.)
Priority 10:  Country-specific rules (fine-grained per-country logic)

This prevented my Google disaster—the exception system (Priority 1.5) checked before geographic blocks (Priority 9), so even if I accidentally blocked the entire US, Googlebot's ASN whitelist (AS15169) would override it.

Detection Methods

Geographic Control (The Original Problem)

Remember that Singapore traffic in Plausible? WireWall handles it with multiple levels:

// Country-level blocking
Block: Singapore (SG), Vietnam (VN), Bangladesh (BD)

// City-level blocking (more surgical)
Block: Singapore City, Hanoi, Dhaka

// But whitelist still wins
Whitelist ASN: 15169 (Google), 8075 (Microsoft)

MaxMind GeoLite2 gives me 0.5-2ms lookups with city/region data. Without MaxMind, it falls back to ip-api.com (100-500ms), but for production sites, local databases are essential. I update them weekly (MaxMind releases new data every Tuesday and Friday).

Result: Plausible Analytics now shows clean traffic data. Those 7,259 fake Singapore requests? Gone. Real visitors only.

Fake Browser Detection

Real browsers send 12-15 HTTP headers. Scripts using curl typically send 3-4:

# curl default
GET / HTTP/1.1
Host: example.com
User-Agent: curl/7.81.0
Accept: */*

# Real Chrome
GET / HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0...
Accept: text/html,application/xhtml+xml...
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate, br
Sec-Ch-Ua: "Chrome";v="120"...
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none

WireWall checks multiple signals:

// Real browsers ALWAYS send these headers
$hasAcceptLanguage = !empty($_SERVER['HTTP_ACCEPT_LANGUAGE']);
$hasAcceptEncoding = !empty($_SERVER['HTTP_ACCEPT_ENCODING']);
$hasAccept = !empty($_SERVER['HTTP_ACCEPT']);

// Missing critical headers = FAKE
if (!$hasAcceptLanguage || !$hasAcceptEncoding || !$hasAccept) {
    return true; // Blocked
}

// Suspicious Accept header (wget/curl send only */*)
if ($_SERVER['HTTP_ACCEPT'] === '*/*') {
    return true; // Blocked
}

// Chrome 90+ should have Sec-CH-UA header
if ($chromeVersion >= 90 && empty($_SERVER['HTTP_SEC_CH_UA'])) {
    return true; // Likely automation
}

This catches sophisticated tools like curl-impersonate that fake TLS fingerprints but forget about HTTP header combinations.

Command-Line Tool & Scanner Detection

Attackers use automated tools. WireWall blocks common patterns:

$badBotPatterns = [
    'scrapy', 'curl', 'wget', 'python-requests',
    'masscan', 'nmap', 'nikto', 'sqlmap',
    'dirbuster', 'acunetix', 'netsparker',
    'semrush', 'ahrefs', 'mj12bot', 'dotbot',
    'zgrab', 'go-http-client', 'nessus', 'metasploit'
];

But simple User-Agent matching isn't enough—attackers fake User-Agents. That's why the header analysis is critical:

// Check for command-line tools
if (empty($_SERVER['HTTP_ACCEPT'])) {
    // No Accept header = curl/wget/script
    return 'blocked';
}

if (strpos($_SERVER['HTTP_USER_AGENT'], 'curl') !== false ||
    strpos($_SERVER['HTTP_USER_AGENT'], 'python-requests') !== false) {
    return 'blocked';
}

Even sophisticated attackers using curl-impersonate (which mimics Chrome's TLS fingerprint) trip up on missing browser APIs that JavaScript challenges can verify.

AI Bot Blocking (The New Frontier)

AI training bots became a major issue in 2025. WireWall blocks:

GPTBot (OpenAI)
ChatGPT-User
ClaudeBot (Anthropic) 
Claude-Web
Google-Extended (Google AI training)
PerplexityBot
GrokBot (xAI)
Cohere-AI
ByteSpider (ByteDance/TikTok)
Meta-ExternalAgent (Meta AI)

These ignore robots.txt and scrape aggressively. The interesting part: they identify themselves in User-Agent strings (unlike malicious scrapers), so detection is straightforward:

protected function getAIBotPatterns() {
    return [
        'gptbot', 'chatgpt', 'claudebot', 'claude-web',
        'anthropic-ai', 'google-extended', 'grokbot',
        'cohere-ai', 'perplexitybot', 'you-ai',
        'bytespider', 'meta-externalagent'
    ];
}

But you need application-level blocking because they use legitimate infrastructure (Google Cloud, AWS)—blocking their ASNs would break everything else running on those networks.

JavaScript Challenge (Anti-Bot Layer)

For suspicious requests (headless browsers, missing headers), WireWall can show a JavaScript challenge:

// Generates challenge token
const timestamp = Date.now();
const token = md5(timestamp + salt);
document.cookie = `ww_challenge=${token}:${timestamp}`;

Command-line tools (curl, wget, scrapers) can't execute JavaScript. Real browsers solve the challenge automatically and get a valid cookie (1-hour validity). This catches sophisticated automation that passes other checks.

The challenge page is beautifully designed with wave patterns—not a captcha, just pure JavaScript execution test.

VPN/Proxy Detection with Triple Fallback

I use three APIs with automatic fallback (important for reliability):

// API 1: ip-api.com (free, no key needed, 45 req/min)
$response = $http->get("http://ip-api.com/json/{$ip}?fields=proxy,hosting");
if (!empty($data['proxy']) || !empty($data['hosting'])) {
    return true; // VPN/Proxy detected
}

// API 2: ipinfo.io (fallback)
$response = $http->get("https://ipinfo.io/{$ip}/json");
// Check 'org' field for hosting/vpn/proxy keywords

// API 3: ipapi.co (fallback)
$response = $http->get("https://ipapi.co/{$ip}/json/");
// Check 'org' field for hosting/vpn/cloud keywords

Results are cached for 7 days. When a VPN is detected, I don't auto-block—I just flag it as suspicious and apply stricter checks. This prevents false positives (legitimate users on corporate VPNs, traveling users) while catching obvious abuse.

Datacenter Blocking

Bots love running from AWS, Google Cloud, DigitalOcean. WireWall detects datacenters via ASN organization names:

$datacenterKeywords = [
    'amazon', 'aws', 'google', 'cloud', 'azure', 'microsoft',
    'digitalocean', 'ovh', 'hetzner', 'linode', 'vultr',
    'hosting', 'datacenter', 'cloudflare', 'akamai', 'fastly'
];

BUT—you need whitelists here. If you use Cloudflare or a CDN, you must whitelist their ASNs, or you'll block all traffic. This is where the priority system saves you: ASN whitelist (Priority 1.5) beats datacenter blocking (Priority 6).

Geographic Blocking

Using MaxMind GeoLite2 database (updated monthly):

$reader = new GeoIp2\Database\Reader('GeoLite2-Country.mmdb');
$country = $reader->country($ip)->country->isoCode;

if (in_array($country, $blockedCountries)) {
    // But check priority first!
    if ($priority < 5) {
        return 'allowed'; // Whitelist wins
    }
    return 'blocked';
}

The priority system means my whitelist overrides geographic blocks—important when I'm traveling.

Real-World Results

Two months in production across multiple sites (e-commerce wine store and regional news portal):

Overall Statistics:

169,423 total requests analyzed
84,601 blocked (49.93% of traffic)
84,822 allowed (50.07% legitimate traffic)
Zero false positives on legitimate users
Server load reduced ~40% (blocked requests never reach app)

Block Reasons Breakdown:

VPN/Proxy/Tor:        57,938 blocks (34.20%)
Bad Bots (global):    24,717 blocks (14.59%)
Rate Limiting:         1,790 blocks ( 1.06%)
Datacenter IPs:          141 blocks ( 0.08%)
JS Challenge:             15 blocks ( 0.01%)

Geographic Distribution:

Singapore: 18,434 requests (10.88% of all traffic!) - mostly ByteSpider and fake browsers
US: 132,831 requests (78.40%) - mix of legitimate and datacenter traffic
GB, HK, BR: Combined 7,000+ requests from various sources

Top Blocked Cities:

1. Ashburn, Virginia    21,882 blocks (AWS datacenters)
2. Singapore             7,259 blocks (the original problem!)
3. Los Angeles           3,102 blocks (proxy networks)
4. New York              3,076 blocks (VPN exits)
5. Buffalo, NY             998 blocks (datacenter hub)

Bot Activity Caught:

SemrushBot:              9,408 blocks (SEO crawler ignoring robots.txt)
ByteSpider (TikTok):     8,692 blocks (AI training scraper)
UptimeRobot:             3,019 blocks (monitoring from datacenters)
Apple Bot:                 945 blocks (undisclosed crawler)

Googlebot:               4,950 ALLOWED (legitimate SEO)
Microsoft Bot:           8,067 ALLOWED (Bing indexing)
Facebook Bot:            3,395 ALLOWED (social previews)

Top Attack Networks (ASNs):

AS16509 Amazon (17,146 blocks) - hosting scrapers
AS209366 SEMrush (9,408 blocks) - aggressive crawler
AS36352 ColoCrossing (4,174 blocks) - cheap VPS provider
AS203020 HostRoyale (3,412 blocks) - proxy service

Most interesting: 80% of blocked traffic had perfect HTTP syntax and valid user agents. They failed application-level checks—missing browser headers, wrong TLS fingerprints, datacenter IPs, suspicious behavioral patterns that only CMS-level detection catches.

Lessons Learned

1. Layered Security is Essential

WireWall doesn't replace Cloudflare or ModSecurity. Each layer catches different attacks:

CDN: DDoS, volumetric attacks, known signatures
Server firewall: IP rate limiting, basic pattern matching
CMS firewall: Authentication bypass, privilege escalation, business logic attacks

2. Admin Protection Must Be Bulletproof

WireWall uses triple-layer admin protection—checks URL path, page template, and root parent. The admin area is Priority 0 (checked FIRST) and never blocked, even if your IP is blacklisted. This saved me multiple times during testing.

// Check URL path
if (strpos($requestUri, $adminPath) === 0) return; // Allow

// Check page template
if ($page->template == 'admin') return; // Allow

// Check root parent (for nested admin pages)
if ($page->rootParent->template == 'admin') return; // Allow

3. Context Beats Signatures

Signature-based detection (looking for UNION SELECT) gets bypassed. Application context (this user shouldn't access admin functions) doesn't.

4. Logging is Critical

In production, attacks manifest in unpredictable ways. Comprehensive logging revealed patterns I never anticipated:

[WireWall] Blocked: IP 1.2.3.4, Reason: Fake browser (missing Accept header)
[WireWall] Blocked: IP 5.6.7.8, Reason: Datacenter (AWS AS16509)
[WireWall] Blocked: IP 9.10.11.12, Reason: City blocked (Singapore)

These logs helped tune detection thresholds and discover new attack vectors. After analyzing 169,423 production requests over 2 months, clear patterns emerged:

34% of blocks were VPN/proxy networks attempting to hide origin
15% were datacenter IPs (AWS, Google Cloud) running scrapers
Ashburn, Virginia alone generated 21,882 blocked requests (AWS hosting)
ByteSpider (TikTok's AI trainer) hit sites 8,692 times despite robots.txt

Without detailed logging, these patterns would be invisible. Each log entry includes city, region, ASN, and block reason—essential for understanding attack geography and adjusting rules.

5. Exception System Prevents False Positives

Three-tier exceptions (IP whitelist, ASN whitelist, User-Agent whitelist) prevent legitimate traffic from being blocked. ASN whitelisting is particularly powerful—whitelist AS15169 once, and all Google services (Googlebot, Google Cloud, etc.) bypass all checks.

Technical Stack

WireWall 1.2.0 - Latest stable release (December 2025)
ProcessWire CMS - hooks into request initialization before routing
PHP 8.1+ - typed properties, match expressions for clean code
MaxMind GeoLite2 - Three databases: Country (~5MB), ASN (~7MB), City (~70MB)
Composer - GeoIP2 library for MaxMind integration
Multi-API fallback - ip-api.com, ipinfo.io, ipapi.co for VPN/proxy detection
File-based cache - scales to 1M+ IPs without database overhead (0.1ms cache hits, 7-30 day TTL)

Why This Matters

That fake Singapore traffic in Plausible? 7,259 blocked requests from one city alone. It wasn't just analytics pollution—it was wasted server resources, skewed business decisions, and potential attack reconnaissance.

Most developers think: "I have Cloudflare, I'm protected." But application-layer attacks bypass network firewalls entirely. You need defense at every layer:

CDN/WAF: Stops volumetric attacks, obvious malicious patterns
Server firewall: Adds rate limiting, basic IP filtering
CMS firewall: Catches what others can't—authentication bypass, privilege escalation, business logic attacks

CMS-level firewalls see what others can't: user identity, session validity, authorization context, plugin state. That's where sophisticated attacks happen—the ones that look like legitimate traffic until you understand what the application is actually doing.

The numbers don't lie: Out of 169,423 requests analyzed, 84,601 were malicious (49.93%). Nearly half of all traffic. These had valid HTTP syntax, realistic user agents, proper SSL/TLS—but failed application-level checks. No network firewall could catch them because they looked legitimate at the protocol level.

The 34% blocked for VPN/Proxy, 14.6% for bad bots, the thousands of requests from datacenter IPs—all invisible to traditional firewalls until you check what the application knows about each request.

WireWall is open source and production-ready for ProcessWire CMS:

GitHub: github.com/mxmsmnv/WireWall
Website: wirewall.org

The module is actively maintained and battle-tested on production sites handling thousands of daily visitors. Installation takes 5 minutes, MaxMind setup adds another 10. Currently deployed across e-commerce platforms and content sites with 169,000+ requests analyzed.

Top comments (1)

Maxim Alex • Dec 31 '25

If you have any questions or wishes, please write below. Thank you!