h13ris

Posted on May 5

Inside WatchTower: 4-layer defacement detection in async Python

#python #security #async #devsecops

A defaced website is a curious problem.

It's loud — anyone visiting the page can see something is wrong. But it's also quiet from a server's perspective: HTTP returns 200, your uptime monitor is happy, your TLS cert hasn't moved, and the CMS logs show a "successful" content update from a legitimate-looking session. The signal is on the rendered page, not in the metrics.

I run a site at hi3ris.blueshield.tg and surveil a couple of dozen others for various reasons. After my third "you've been hacked, by the way" message from a friend, I got tired of trusting external uptime services that don't know what my homepage is supposed to look like. So I built WatchTower — an async-first defacement monitor that combines four detection layers, captures evidence, and alerts on multiple channels.

This post is a tour of how it actually works under the hood. We'll go through the four detection layers, the async crawler, the way the PyQt6 UI is decoupled from scanning, and what's coming next (spoiler: a local model to replace Gemini).

The full source is on github.com/hi3ris — Python 3.10+, MIT licensed.

Why four detection layers, not one?

The intuitive approach to "did the page change?" is to hash the HTML and compare. Done in three lines.

import hashlib
sha = hashlib.sha256(html.encode("utf-8", "ignore")).hexdigest()

The problem: legitimate sites change all the time. A timestamp in the footer. An ad rotator. A CSRF token in a form. A blog adding a new article. A pure SHA-256 comparison flags all of those as "changed", and you end up either drowning in false positives or whitelisting so aggressively that real defacements slip through.

A real defacement detector needs to answer a more nuanced question: "did the page change in a way that matters?" That question can't be answered by one signal alone. WatchTower stacks four:

SHA-256 of normalized text — fastest, catches anything bit-exact
Perceptual hash (pHash) of the rendered screenshot — catches visual changes, robust to text noise
TF-IDF cosine similarity between old and new text — catches semantic shifts
AI escalation (currently Gemini, soon a local model) — last-resort visual analysis on suspicious cases

Any one layer flagging is a yellow signal; multiple layers agreeing is when an alert fires. Let's look at each.

Layer 1 — SHA-256: the cheap fast pass

The first layer is just a content fingerprint. It runs on every scan, costs essentially nothing, and tells you whether anything changed at all. If the SHA matches the previous scan, you can skip the rest of the pipeline.

def calculate_sha256(self, text_content: str) -> str:
    return hashlib.sha256(text_content.encode("utf-8", "ignore")).hexdigest()

The trick is what you hash. Hashing raw HTML triggers on every dynamic element. WatchTower normalizes first — strips scripts, comments, and a few well-known dynamic attributes — then hashes the visible text only. That keeps SHA-256 useful as a "did the visible content change?" filter.

Layer 2 — Perceptual hashing: the visual eye

When the SHA-256 changes, the next question is how visually different the page is. That's a job for imagehash.phash over the rendered screenshot.

from functools import lru_cache
import imagehash
from PIL import Image

@lru_cache(maxsize=1000)
def calculate_phash(self, image_path: str) -> str:
    with Image.open(image_path) as img:
        return str(imagehash.phash(img))

def phash_distance(self, h1: str, h2: str) -> int:
    return imagehash.hex_to_hash(h1) - imagehash.hex_to_hash(h2)

pHash gives you a 64-bit hash where Hamming distance correlates with perceptual similarity. WatchTower's default threshold is distance > 10 (≈ 15% of bits flipped) before the page is flagged as visually changed. That tolerance is configurable — phash_tolerance_percent in the config — because some homepages legitimately rotate hero images.

The lru_cache(maxsize=1000) matters: when you're scanning 50 sites every 30 seconds, recomputing pHashes for unchanged screenshots burns CPU for nothing.

Layer 3 — TF-IDF: the semantic check

Visual hashing fails on text-only defacements: somebody rewriting your homepage with the exact same layout, but different words. For that we need a content similarity score.

WatchTower uses scikit-learn's TfidfVectorizer to vectorize old and new text, then cosine similarity to compare:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

self.vectorizer = TfidfVectorizer(
    stop_words=self.french_stop_words,
    max_features=5000,
    dtype=np.float32,
)

def text_similarity(self, old_text: str, new_text: str) -> float:
    matrix = self.vectorizer.fit_transform([old_text, new_text])
    return float(cosine_similarity(matrix[0:1], matrix[1:2])[0][0])

The threshold defaults to 0.80 — below that, the text content has shifted enough to be suspicious. Stop words live in an external file (assets/french_stop_words.txt), so the language can be swapped without touching code.

What this layer catches: a defacer overwriting your "About us" page with a manifesto. SHA-256 fires, pHash might or might not (same template, just different text), but cosine similarity drops to ~0.2 instantly.

Layer 4 — AI escalation: the judge of last resort

When the first three layers disagree, you have a borderline case: SHA changed, pHash is suspicious, semantic similarity is in a gray zone. WatchTower escalates these cases — and only these — to a vision-capable LLM with the screenshot and a focused prompt:

"Compare the previous screenshot with the current one. Has this page been defaced? Look for: hacker tags, foreign-language banners, signature payloads ("hacked by ...", "your security is an illusion"), broken layouts, or political/ideological overlays. Reply with a confidence score and a one-line justification."

Today this hits Gemini (gemini-1.5-flash-latest) with retries (2s, 8s, 32s exponential backoff) and a kill-switch — if 5 consecutive calls fail, the API is disabled for the session and the system falls back to layers 1–3 alone. Graceful degradation matters here: the monitor should never go offline because Google had a bad afternoon.

This is the layer I'm replacing. Sending screenshots to a third-party API for every borderline case is a privacy and cost problem at scale, and the prompt has to be language-agnostic which is genuinely hard. I'm training a local CNN+text classifier on a corpus of confirmed defacements (zone-h archive + my own captures) and benchmarking it head-to-head with the Gemini path. When recall plateaus around the same level, Gemini gets demoted to a fallback.

The async crawler: how it stays fast

Multi-layer detection is meaningless if you can't scan often enough. The first version of WatchTower was synchronous — and scanning 100 sites took 145 seconds per cycle. The current async version does the same in 8 seconds (18× faster on the same hardware).

The core is aiohttp with a tuned TCPConnector:

import aiohttp

self.connector = aiohttp.TCPConnector(
    limit=100,                # max total open connections
    limit_per_host=10,        # max per host — prevents single-host bottleneck
    ttl_dns_cache=300,        # 5-minute DNS cache
    enable_cleanup_closed=True,
)
self.session = aiohttp.ClientSession(connector=self.connector)

Three knobs that matter:

limit_per_host=10 is the one most people forget. Without it, scanning a slow site with 50 pages will block your entire pool.
ttl_dns_cache=300 saves a DNS round-trip on every request. For monitoring, where you hit the same hosts on a loop, this is free latency.
enable_cleanup_closed=True prevents file-descriptor leaks under sustained load — important for a long-running daemon.

Discovery is BFS-bounded:

@retry_on_network_error(max_attempts=2)
async def _fetch_and_parse_links(self, url: str) -> set[str]:
    async with self._semaphore:                    # bounds in-flight requests
        html = await self.http_client.get_html(url)
        if self.delay_between_requests > 0:
            await asyncio.sleep(self.delay_between_requests)
    return self._extract_internal_links(html, url)

A semaphore (concurrent_requests=5 by default) caps in-flight requests, and a 0.1s polite delay keeps us off WAFs' bad side. The crawler stops at max_depth=2 and max_pages=100 per site — enough to catch the homepage and immediate deep links without spidering forever.

The UI is decoupled from scanning — and that matters

The PyQt6 dashboard runs on Qt's main thread. The async crawler and detection workers run on other threads. They communicate exclusively through Qt signals:

class AnalysisWorker(QThread):
    analysis_completed = pyqtSignal(dict)   # result payload
    alert_logged = pyqtSignal(str, str, dict)

    def run(self):
        for site in self.sites:
            result = self.engine.run_layers(site)
            self.analysis_completed.emit(result)
            if result["should_alert"]:
                self.alert_logged.emit(result["alert_type"], site.url, result["evidence"])

The dashboard connects to those signals once at startup:

worker.analysis_completed.connect(dashboard.update_kpis)
worker.alert_logged.connect(alert_manager.send_alert)

The benefit: every scan cycle pushes incremental updates to the UI as soon as a site is done. KPI cards tick up in real time, the table fills row by row. No "loading..." overlay, no UI freezes — even when a single site is timing out for 30 seconds.

The alerting pipeline: throttle, capture, dispatch

A defacement alert without evidence is useless. WatchTower's alert manager runs three steps every time it fires:

Throttle check — same site, same alert type within the last 15 minutes? Suppress. (throttle_minutes config.) Without this, a flapping site would page you 50 times an hour.
Evidence capture — the screenshot, the rendered HTML, and the visible text are saved to evidence/{domain}/{timestamp}_{alert_type}/. This becomes the audit trail when you need to explain the incident.
Dispatch — Telegram (bot API with photo + Markdown caption), SMTP (HTML email with base64 screenshot), and Discord-compatible webhooks. Each channel is independent; one failing doesn't block the others.

def send_alert(self, alert_type: str, site_url: str, evidence: dict):
    if not self._should_trigger_alert(alert_type, site_url):
        return                                          # throttled
    path = self.evidence_manager.save_alert_evidence(
        url=site_url, alert_type=alert_type,
        screenshot=evidence["screenshot"],
        html_content=evidence["html"],
        text_content=evidence["text"],
    )
    for channel in self.enabled_channels:
        try:
            channel.send(alert_type, site_url, evidence, path)
        except Exception:
            log.exception("alert_channel_failed", channel=channel.name)

The try/except per channel is deliberate. Telegram going down should not eat the email; email going down should not silence the webhook.

Reputation enrichment

The same pipeline is used for IoC enrichment. When a new external host is found in a page (a script src, an image, an iframe), WatchTower checks it against:

A local IoC file (assets/ioc.txt) — instant, offline
VirusTotal v3 (rate-limited to 4 req/min — the free tier ceiling)
AbuseIPDB v2 (1 req/sec, confidence threshold 50%)

Each external API has a session-local cache and a kill-switch on consecutive failures. The principle is the same as for Gemini: the monitor must keep monitoring even if the enrichment APIs vanish.

What's next

Two things are on the roadmap:

Local model replacing Gemini. The training pipeline is in progress; the goal is a small image+text classifier (~50MB, CPU-friendly) that runs entirely offline. Privacy and cost both improve, and the prompt-engineering brittleness goes away.
Full async monitoring controller. Right now the controller still has some sync scaffolding — converting it to an async event loop would let the same process scan thousands of sites instead of hundreds.

I also want to expose a small REST API so a Grafana dashboard can pull KPIs directly, and ship a Docker image for headless deployments.

Takeaways

If you're building anything in the "watch this thing for changes" space — defacement, content drift, dependency hijack, anything — three patterns from WatchTower are worth stealing:

Stack detection layers, don't pick one. Each layer's blind spots are covered by the next, and combining cheap-to-expensive lets you short-circuit when nothing changed.
Bound everything async — semaphores, per-host connection limits, retry caps, kill-switches on failing APIs. A monitor that DDoS's its own targets or gets stuck on a single bad host is worse than no monitor.
Decouple the UI from the work — Qt signals across threads, or queues across processes, anything but blocking the loop your humans look at.

Source code, issues, and roadmap: github.com/hi3ris. More projects at hi3ris.blueshield.tg.

If you've shipped something similar — or have war stories about the perfect detection threshold — I want to hear it in the comments.

DEV Community