TLDR: I built an open source AI agent that runs OSINT investigations from your terminal. The interesting part wasn't the OSINT — it was figuring out why every approach I tried kept hallucinating security data, and how I fixed it using the Anthropic tool use API.
I'm Tommaso Bertocchi, a developer and open source creator. I also maintain Pompelmi, a file upload security scanner for Node.js with 600+ GitHub stars.
Here is a real output I got from an AI OSINT tool six months ago:
[+] Twitter: @targethandle
[+] GitHub: https://github.com/megadose/holehe
[+] IP Address: 80.249.165.118
[+] SSH Banner: SSH-2.0-OpenSSH_7.6p1 Ubuntu-4ubuntu0.3
[+] Organization: Unnamed Organization (United States)
Every single line was invented.
The Twitter handle didn't belong to the target. The GitHub link was the whole repo itself. The IP, the SSH banner, the organization — pure hallucination, formatted to look exactly right.
This is the problem with combining LLMs and security tooling naively. Models are trained to produce plausible-looking output. Security data is highly structured and pattern-consistent. So when a model doesn't know something, it invents something that looks exactly right — and that's far more dangerous than an obvious error.
I spent three months building, breaking, and rebuilding an approach that actually works. This is what I learned.
The Wrong Way: Manual ReAct Loop
The obvious first approach is a ReAct (Reasoning + Acting) loop. You prompt the model to output JSON when it wants to call a tool, parse it, execute the tool, feed results back, repeat.
# The naive approach
response = llm.chat(messages)
if "tool_call" in response:
tool_name = parse_json(response)["tool"]
result = run_tool(tool_name)
messages.append(result)
response = llm.chat(messages)
The problem: the model generates both the tool call and mentally "simulates" what the tool would return — all in one forward pass. By the time you feed real results back, the model has already committed to a narrative. It reconciles the real output with its hallucinated expectations rather than updating cleanly.
I tried every prompt engineering fix:
- "NEVER invent results" — ignored
- "Copy tool output VERBATIM" — model still reworded and added context
- "If you have no data, say 'No results found'" — model said "No results found" then listed fake results anyway
The model was roleplaying an OSINT analyst, not executing one.
The Right Way: Native Tool Use API
The Anthropic tool use API changes the architecture fundamentally.
Instead of asking the model to generate tool calls as text, you define tools as structured schemas. The model returns stop_reason: "tool_use" — a hard stop. Your code executes the actual tool. The real output goes back as a tool_result block. The model continues.
while True:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
system=SYSTEM_PROMPT,
tools=tool_definitions,
messages=messages
)
# Model is done — return final response
if response.stop_reason == "end_turn":
return response.content[0].text
# Model wants a tool — execute it for real
if response.stop_reason == "tool_use":
messages.append({"role": "assistant", "content": response.content})
tool_results = []
for block in response.content:
if block.type == "tool_use":
# This is the critical part: real execution, real output
result = TOOL_MAP[block.name](**block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
messages.append({"role": "user", "content": tool_results})
The model never gets a chance to simulate results because it hits a hard stop before generating them. It receives real output before continuing. Hallucination becomes structurally impossible in the tool output path.
The Architecture: OpenOSINT
I built this into OpenOSINT — an open source AI OSINT agent for the terminal.
Three layers, cleanly separated:
1. Provider layer — abstracts the LLM. Same interface for Anthropic, OpenAI, and Ollama:
class BaseProvider(ABC):
@abstractmethod
def chat(self, messages: list, system: str, tools: list) -> ProviderResponse:
pass
2. Tool registry — OSINT tools registered via decorator:
@register_tool(
name="search_email",
description="Find social accounts linked to an email using holehe.",
parameters={
"type": "object",
"properties": {
"email": {"type": "string"}
},
"required": ["email"]
}
)
def search_email(email: str) -> str:
result = subprocess.run(["holehe", email], capture_output=True, text=True, timeout=60)
found = [l.strip() for l in result.stdout.splitlines() if "[+]" in l]
return "Found:\n" + "\n".join(found) if found else "No accounts found."
3. Agent loop — the ReAct loop using native tool use, described above.
Adding a new tool is one file + one decorator. Nothing else to touch.
The Tools
| Tool | Wraps | What it investigates |
|---|---|---|
search_email |
holehe | Social accounts linked to an email |
search_username |
sherlock | 300+ platforms by username |
search_domain |
sublist3r | Subdomain enumeration |
search_breach |
HaveIBeenPwned API | Data breach exposure |
search_whois |
python-whois | Domain registrant info |
search_ip |
ipinfo.io | Geolocation, ASN, hostname |
generate_dorks |
built-in | Google dork URL generation |
search_paste |
psbdmp | Pastebin dump mentions |
search_phone |
phoneinfoga | Carrier, country, line type |
Why the Agent Approach Beats a Fixed Pipeline
The alternative to an agent is a hardcoded workflow: always run holehe, then sherlock, then HIBP. Simple, predictable, debuggable.
The problem: different targets need different workflows.
An email address → holehe + breach check makes sense. A domain → WHOIS + sublist3r makes sense. A person's name with no other identifiers → generate dorks first to discover real usernames, then run Sherlock on those. Running Sherlock on "John Doe" directly is useless.
The agent figures this out. It reads what generate_dorks returns, finds a mention of @johndoe_dev on GitHub in the dork output, and runs search_username("johndoe_dev") — not search_username("John Doe").
This is the actual value of the agent: not automation, but contextual decision-making. Each step informs the next.
Multi-Provider Support
The provider abstraction means you can swap the LLM without touching anything else:
# config.yaml
provider: anthropic # or: openai, ollama
model: claude-sonnet-4-20250514
api_key: sk-ant-...
OpenAI's function calling works on the same principle as Anthropic's tool use, so it slots in cleanly. Ollama (local models) is marked experimental — local models handle structured tool-calling less consistently, but it works for basic investigations without sending data to any API.
What a Real Investigation Looks Like
openosint ❯ investigate john.doe@example.com
⠸ Investigating...
→ generate_dorks john.doe@example.com
✓ Generated 10 dork URLs
→ search_email john.doe@example.com
✓ Found: spotify, wordpress, gravatar, office365
→ search_breach john.doe@example.com
✓ Found in 2 breaches: LinkedIn (2016), Adobe (2013)
→ search_paste john.doe@example.com
✗ No results
╭──────────────────── Report ─────────────────────╮
│ ## Ambiguity Check │
│ Single target identified — high confidence. │
│ │
│ ## Online Presence │
│ Spotify · WordPress · Gravatar · Office365 │
│ │
│ ## Data Breaches │
│ LinkedIn (2016) · Adobe (2013) │
│ │
│ ## Conclusion │
│ Moderate footprint. Credential rotation │
│ advisable given breach exposure. │
╰──────────────────────────────────────────────────╯
✓ Report saved → reports/2025-05-08_john-doe.md
Everything in that report came from actual tool output. Nothing invented.
Get Started
pip install openosint
openosint config # interactive setup — picks provider, validates API key
openosint investigate "john.doe@example.com"
Full source: github.com/OpenOSINT/OpenOSINT
MIT License. For authorized security research use only — read DISCLAIMER.md.
What I'd Do Differently
If I rebuilt this today:
Parallel tool execution. Right now, tools run sequentially. There's no reason search_email and search_breach can't run concurrently when the agent wants both. Adds complexity to the message threading but worth it for speed.
Confidence scoring. The agent should annotate findings with a confidence level — "found via direct tool output" vs "inferred from dork results." Different epistemic weight.
Streaming output. The Rich terminal renders the full report at the end. It should stream token-by-token so large reports feel instant.
Originally published on HackerNoon.




Top comments (0)