DEV Community

Cover image for Why I built an async recursive engine for OSINT (and why I nearly went insane doing it)
NOX Project
NOX Project

Posted on

Why I built an async recursive engine for OSINT (and why I nearly went insane doing it)

A few months ago, I found myself stuck in a massive OSINT rabbit hole. The routine was always the same: find an email, check a breach database, find a handle, search that handle elsewhere, find a hash, try to crack it... and repeat for three hours.

I realized I wasn't doing "investigation" anymore; I was just acting like a manual script. My brain was melting. So, I decided to automate that "avalanche effect" and ended up building NOX.

The Core Idea: The "Avalanche" Effect
Most OSINT tools are linear. You give them a seed, they give you a list.

I wanted NOX to be recursive. In the nox.py core, I implemented what I call Recursive Reinjection. Every time the engine discovers a new unique asset—a different email, a specific handle, or a high-fidelity hash—it doesn't just log it. It automatically re-injects it as a new search seed.

It maps out the entire "identity blast radius" in seconds. But, as you can imagine, managing recursion depth so you don't accidentally try to map the entire internet starting from a handle like "admin" was... interesting.

Why Asyncio?
Interrogating 120+ sources (APIs, scrapers, search dorks) synchronously is a suicide mission for performance. I went with Python/Asyncio because I needed to handle hundreds of concurrent requests without the overhead of a massive thread pool.

# The simplified logic behind the pivot
if discovered_assets:
    for asset in discovered_assets:
        if asset not in explored_set:
            await self.pivot_search(asset)
Enter fullscreen mode Exit fullscreen mode

The speed is great, but speed is useless if you get blocked by the first firewall you hit.

The WAF War: JA3 and TLS Fingerprinting
This was the biggest "hidden" cost of the project. Modern WAFs like Cloudflare and Akamai are incredibly good at spotting standard Python libraries. Even if you rotate User-Agents, they’ll catch you during the TLS handshake (JA3 fingerprinting).

To keep NOX alive, I had to implement randomized TLS signatures and custom headers to mimic real browser behavior. It’s a constant game of cat-and-mouse. Every time a major source moves their goalposts, I’m back in the code updating fingerprints.

Solving the "Signal vs. Noise" Problem
OSINT generates a lot of garbage. If a leak is from 2012, how much do I actually care about that password?

I added a risk_score logic that weights results based on:

Recency: How old is the leak?

Uniqueness: A bcrypt hash or a unique email is a "high-fidelity bridge," while a common handle is just noise.

Source Reliability: Not all databases are created equal.

Why Open Source?
Honestly? Because maintaining 120+ scrapers alone is impossible. Sites change their DOM, APIs move to v2, and WAF rules evolve every week.

I’m hoping the community can help me keep the "signatures" and scrapers updated while I focus on improving the relational graphing and the HVT (High Value Target) detection logic.

If you’re into Red Teaming, Bug Bounty, or just like seeing how far you can push Asyncio, give it a spin. The code is probably a bit "cursed" in some places, but it’s saved me a ton of time on initial recon.

Check the repo here: https://github.com/nox-project/nox-framework

I’m curious—how are you guys handling the noise-to-signal ratio when you’re dealing with massive relational datasets like this? Do you prefer manual pruning or do you trust the automated scoring?

Top comments (0)