By RUGERO Tesla (@404Saint).
It started with an article I couldn't stop thinking about
A few months back I read about how attackers were poisoning search results to push malicious software downloads. The attack isn't sophisticated. You register a convincing-looking domain, keyword-stuff it correctly, buy or manipulate your way into the top results, and wait. Someone searches "Siemens TIA Portal V17 download", clicks the third result, and downloads a trojanised installer.
What got me wasn't that it worked. It was how it worked. People trust search results. Not because they've verified them. Just because they're there.
And the thing is, most people only check one search engine.
That thought wouldn't leave me alone. If an attacker has to poison Google AND Bing AND Brave AND DuckDuckGo simultaneously for the same query at comparable rank positions... that's a much harder problem. Cross-referencing results across engines should make poisoned results stick out.
So one slow weekend I started building something. I called it Arkoi.
The question I wanted to answer
Every URL scanner I know of asks: is this URL dangerous?
I wanted to ask something different: given that I searched for X, does this result actually belong here?
That sounds subtle but it changes a lot. A two-year-old domain with a clean URLhaus record can still be a poisoned result if it's ranking #2 on Google for a specific enterprise software query while being completely absent everywhere else. The domain isn't inherently dangerous. It's contextually wrong. That's the signal.
How it actually works
Parsing the query first
Before fetching anything, Arkoi tries to understand what you're actually looking for. It pulls out the vendor, the software name, and the version from raw text.
So "Siemens TIA Portal V17 download" becomes:
vendor : siemens
version : V17
tokens : ['siemens', 'tia', 'portal', 'v17']
It also handles product aliases. Search for "autocad" and it maps to Autodesk's vendor profile. "matlab" maps to MathWorks. "pycharm" maps to JetBrains. You don't need to know who makes what.
Fetching six engines at once
All six engines (Google, Bing, Brave, DuckDuckGo, Yahoo, Yandex) get queried in parallel through a self-hosted SearXNG instance. Results come back merged and deduplicated by domain, with each result carrying a record of which engines returned it and at what rank.
async def fetch_all(query: str) -> tuple[list[SearchResult], int]:
async with aiohttp.ClientSession() as session:
tasks = [_fetch_engine(session, eng, query) for eng in ENGINES]
results_per_engine = await asyncio.gather(*tasks)
responded = sum(1 for r in results_per_engine if r)
raw = [item for engine_results in results_per_engine for item in engine_results]
return _merge_results(raw), max(responded, 1)
The number of engines that actually responded matters because it's the denominator for consensus scoring. If only three engines respond, a result appearing on two of them is medium consensus, not low.
Six signal checks per result, all concurrent
Vendor domain verification. Does this domain actually belong to the vendor you searched for? There are four possible outcomes: VENDOR_MATCH (it's them), TRUSTED_PARTNER (it's a safe subdomain or official partner), VENDOR_IMPOSTER (the domain contains the vendor name but isn't theirs, like siemens-downloads.net), and UNRELATED.
The imposter case is the most dangerous one and the easiest to catch.
Cross-engine consensus. What share of responding engines returned this domain? 60% or above is high consensus. Below 33% is low. A result that only shows up on one engine for a well-known software query is already worth questioning.
Rank anomaly. Is an unrelated domain sitting in the top 3? Is the official vendor domain buried past position 5 while other domains outrank it? Either pattern is a flag.
Query-result relevance. Token overlap, keyword stuffing detection, and URL path analysis. If the path contains things like /full-version/, /googledrive/, /crack/, that's a direct signal. Known platforms like YouTube and Reddit are excluded from the stuffing check because their titles naturally repeat search terms and that's just how they work.
URLhaus lookup. Async check against the abuse.ch database. If the domain is a known malware host, that surfaces immediately regardless of everything else.
Domain age. WHOIS with a hard 6-second timeout. The timeout matters because without it, stalled WHOIS connections hold up the entire pipeline. Only domains under 180 days get flagged. Older domains get no age penalty regardless of anything else.
Verdicts, not scores
This is the part I'm most opinionated about. No percentage scores. Four categories with explicit reasons:
| Verdict | What it means |
|---|---|
✓ TRUSTED |
Official vendor or trusted partner, consistent across engines |
? UNVERIFIED |
No red flags, but no vendor relationship confirmed either |
⚠ SUSPICIOUS |
Something's off. New domain, rank anomaly, suspicious path |
✗ DECEPTIVE |
Clear indicators of deceptive placement |
The UNVERIFIED state was the most important one to get right. An earlier version showed anything without red flags as green. That's not safe, that's just uninspected. "We found nothing wrong" and "this is safe" are different things.
The stuff I got wrong
The first version had numeric percentage scores, SSL certificate issuer checking, and keyword scoring. All three were mistakes.
Percentage scores sounded precise but weren't. Where does 67% come from? Arbitrary thresholds added together. Replacing scores with categorical verdicts plus explicit reasoning is more honest and actually more useful because you can see why something got flagged.
SSL issuer checking was noise. In 2025, penalising a domain for using Let's Encrypt tells you it's cost-conscious, not malicious. Millions of legitimate sites use DV certs. Dropped entirely.
Keyword scoring fired too broadly. "Free download" catches CNET. "Full version" catches vendor trial pages. The signal-to-noise ratio was terrible. Replaced with vendor domain mismatch detection and URL path analysis, which are actually precise.
The biggest practical problem was speed. Everything ran sequentially in the first version. Twelve results times three slow network checks each meant runs taking close to two minutes. Rewriting with asyncio and running all per-result checks concurrently got this to around 9 seconds.
Why SearXNG
Arkoi requires a self-hosted SearXNG instance. That's a real dependency and worth explaining.
Scraping search engines directly is legally grey and technically fragile. Official APIs are rate-limited, paid, and different for every engine. SearXNG handles all of this cleanly. One local endpoint, six engines, no API keys, privacy-preserving by default.
docker run -d -p 8080:8080 searxng/searxng
The downside is that not all SearXNG configs have all engines enabled out of the box. In my testing only 3 of 6 engines consistently responded. The consensus logic adapts to however many engines actually returned results so it degrades gracefully.
Where it still falls short
WHOIS age is less useful than I hoped. Privacy protection and rate limiting mean most domains come back as UNKNOWN rather than an actual age. Age works as a supporting signal when it's available but you can't rely on it.
Yandex skews rank anomaly detection. Yandex's ordering for Western software queries is genuinely different from other engines. A YouTube tutorial ranked #1 by Yandex isn't poisoning, it's just Yandex. The rank anomaly check needs engine-aware weighting to handle this properly.
No vendor match means less precision. If your query doesn't hit any of the 50+ vendor profiles, vendor verification gets skipped and you're left with consensus and anomaly scoring only. Still useful, but clearly a step down.
Try it
git clone https://github.com/404saint/arkoi.git
cd arkoi
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# Start SearXNG
docker run -d -p 8080:8080 searxng/searxng
# Run it
python arkoi.py "AutoCAD 2025 download"
python arkoi.py "Wireshark install"
python arkoi.py "Adobe Photoshop free download"
Tagged v0.1.0-alpha. Pre-release, not production ready. Known issues are in the GitHub tracker. The README and CONTRIBUTING docs cover everything you'd need to add a vendor or pick up an open issue.
⭐ GitHub
If this was useful or interesting, a star helps other people find it. Contributions welcome, especially vendor registry additions and the missing test suite. Open a PR and the CONTRIBUTING guide will walk you through it.
Built by RUGERO Tesla · GitHub: @404Saint
Started as a bored weekend experiment. Turned out to be a more interesting problem than I expected.
Top comments (0)