Chudi Nnorukam

Posted on Feb 10 • Edited on Jul 6 • Originally published at chudi.dev

Bug Bounty Multi-Agent Architecture: 40% Fewer False Positives

#bugbounty #ai #automation #security

Originally published at chudi.dev

Bug bounty automation architecture is the system design layer that separates recon, testing, validation, and reporting into specialized agents so findings earn trust before they consume human time.

If you're trying to automate security work, this is the part that decides whether you build a credible pipeline or a false-positive generator. It connects directly to my broader system write-up on building a semi-autonomous bug bounty system, the evidence discipline in bug bounty validation and false positives, and the review boundary explained in human-in-the-loop bug bounty workflows.

I ran my first automated vulnerability scan three months ago. Found 47 "critical" vulnerabilities. Submitted 12 reports.

Every single one was a false positive.

That specific embarrassment--of confidently submitting garbage to a program that now knows my name--still stings when I think about it. Traditional scanners generate noise. They don't think. They pattern match and hope something sticks.

Building AI-powered bug bounty automation requires multi-agent architecture where specialized agents handle reconnaissance, testing, validation, and reporting independently. The key innovation isn't automation itself--it's evidence-gated progression where findings must reach 0.85+ confidence through validated proof-of-concept execution before ever reaching human review. This prevents the false positive flood that destroys researcher reputation.

Why Did I Choose Multi-Agent Over Monolithic Scanners?

Multi-agent architecture beats monolithic scanners because each agent operates and recovers independently. When a monolithic scanner hits a rate limit, everything stops. In a multi-agent system, only the affected agent pauses while recon, testing, and validation continue in parallel. This resilience, combined with specialized agents for each task type, makes the system dramatically more reliable.

Monolithic scanners are brittle. Hit a rate limit? Everything stops. Encounter a CAPTCHA? Dead. One endpoint times out? The whole queue backs up.

I built something different--a 4-tier agent system where each agent operates independently:

Recon Agents run passive discovery in parallel. Subdomain enumeration via certificate transparency. Technology fingerprinting with httpx. JavaScript analysis for hidden endpoints. GraphQL introspection. They feed assets into a shared database but never block each other. The vulnerability classes they surface map to the OWASP Top Ten.

Testing Agents take those assets and probe for vulnerabilities. IDOR testing with multi-account replay. XSS payload injection. SQL injection patterns. SSRF with metadata service probing. Maximum 4 concurrent agents to avoid rate limiting--but each recovers independently if throttled. Testing follows the methodology outlined in the OWASP Web Security Testing Guide.

Validation Agent is the gatekeeper. Every finding goes through proof-of-concept execution before advancing. This is where I learned the hard lesson: detection is not exploitation. More on this in part 2 of this series.

Reporter Agent generates platform-specific reports only for validated findings. CVSS scoring, reproducible PoC code, evidence attachments. Vulnerability types are classified using MITRE CWE identifiers. Different formatters for HackerOne, Intigriti, Bugcrowd--covered in part 4.

Without me realizing it, I was building a system that mirrors how good human researchers actually work--parallel reconnaissance, focused testing, ruthless validation, careful reporting.

[!TIP]
The orchestrator (Claude Opus 4.6) coordinates all agents but doesn't do the work itself. It distributes tasks, manages budgets, detects failures, and persists session state. Like a project manager who never touches code.

How Does Evidence-Gated Progression Actually Work?

Evidence-gated progression assigns every finding a confidence score from 0.0 to 1.0 that changes as evidence accumulates. A detected XSS starts around 0.3. The validation agent runs proof-of-concept execution in a sandboxed environment. If the PoC succeeds and demonstrates real exploitation, confidence rises to 0.85 or higher and the finding enters human review. Failed PoCs decrease confidence.

Every finding has a confidence score from 0.0 to 1.0. But here's what makes this different from typical severity ratings--the score changes as evidence accumulates.

A finding starts at maybe 0.3 when first detected. The Testing Agent found something that looks like reflected XSS. Could be real. Could be the payload appearing in an error message (harmless).

Then Validation runs. PoC execution in a sandboxed environment. Response diff analysis comparing baseline vs. vulnerable responses. False positive signature matching. This is where evidence-based verification becomes critical—each finding must survive adversarial scrutiny before it's credible.

If the PoC executes successfully and the response actually demonstrates exploitation (not just reflection)--confidence jumps to 0.85+. Now it's queued for human review.

If PoC fails? Confidence drops. Maybe to 0.4. Still logged for weekly batch review, but not wasting human attention.

Finding Lifecycle:
Discovered (0.3) → Validating (varies) → Reviewed (human) → Submitted/Dismissed
                         ↑
              Confidence may increase or DECREASE

I originally thought validation would only increase confidence. Well, it's more like... validation is the adversary. It's trying to disprove your finding. Survivng that adversarial process is what makes a finding credible.

What Is SQLite RAG and Why Does It Matter?

SQLite RAG stores semantic embeddings of previous findings, failures, and success patterns so the system learns across sessions. When testing a new Laravel target, it retrieves what techniques worked on similar stacks and which patterns produce false positives. This accumulated context improves accuracy over time, reducing noise and prioritizing testing strategies with proven track records on comparable technology.

RAG stands for Retrieval-Augmented Generation. But forget the buzzword--here's what it actually does:

The system remembers. Not just what vulnerabilities it found, but what worked, what failed, and why.

When I start testing a new target running Laravel, the system queries its knowledge base: "What techniques succeeded on Laravel targets before? What false positive patterns should I avoid?"

It retrieves relevant context and adjusts strategy accordingly.

That memory layer also matters when you expand across multi-platform bug bounty programs. HackerOne, Bugcrowd, Intigriti, and direct programs each reward different evidence patterns, rate-limit behavior, and submission formats, so the architecture has to learn platform-specific retrieval cues instead of treating every target like the same queue.

The database has 13 tables but three matter most:

Table	Purpose
`knowledge_base`	Semantic embeddings of past findings and techniques
`false_positive_signatures`	Known patterns that look like vulns but aren't
`failure_patterns`	Recovery strategies for different error types

That last table connects directly to part 3 of this series--failure-driven learning is where the system actually gets smarter over time.

How Does the Orchestrator Coordinate Everything?

The orchestrator coordinates by distributing tasks across agents, managing API budget limits, detecting failures, and saving session checkpoints every five minutes. It does not perform the actual work — recon, testing, and validation agents do. The orchestrator functions like a project manager: assigning tasks, tracking progress, handling errors, and ensuring the overall pipeline continues running even when individual agents fail.

Claude Opus 4.6 runs the show. But it's constrained by design.

The orchestrator handles:

Task distribution: Which agents work on which assets
Budget management: API call limits per platform, token usage tracking
Failure detection: When an agent hits errors, classify and recover
Session persistence: Checkpoint every 5 minutes for crash recovery

Here's the key pattern--the BaseAgent abstraction:

abstract class BaseAgent {
  protected config: AgentConfig;
  abstract execute(params: Record<string, unknown>): Promise<AgentResult>;
  protected async withTimeout<T>(promise, ms) { /* ... */ }
}

Every specialized agent inherits this contract. Testing agents, recon agents, validation--all share consistent timeout handling, error propagation, and rate limit enforcement.

I love constraints like this. They prevent the AI from getting creative in ways that break things.

[!WARNING]
Without the BaseAgent contract, each agent would handle errors differently. Some might retry infinitely. Some might swallow errors silently. The abstraction enforces consistency across a system that could easily become chaotic.

How Do You Resume After a Crash?

You resume after a crash by running the system with a --resume session-id flag. The system saves checkpoints to SQLite every five minutes, capturing session state, discovered assets, and all findings. When the process restarts, it reads the latest checkpoint and continues exactly where it stopped. No progress is lost even if the application crashes or your laptop closes.

Session persistence was born from pain.

I was 6 hours into scanning a large program. Found 3 promising leads. System crashed because my laptop went to sleep.

Lost everything.

Now the system saves state every 5 minutes:

checkpoint = {
  sessionId, timestamp, phase, progress,
  discoveredAssets: [...],
  findings: [...]
};
db.insert('context_snapshots', checkpoint);

Resume with pnpm run start --resume session-id and you're back exactly where you left off.

The database persists everything: programs, assets, findings, sessions, failure logs. Even if the application crashes, the SQLite file survives.

What's the Real Code Pattern?

The core code pattern is a state machine that governs finding lifecycle. Findings move through defined states — new, validating, reviewed, submitted, dismissed — with explicit transitions between each. Confidence scores determine which transitions are allowed. A finding cannot reach human review below 0.70, and it can return to validating if new evidence changes its score in either direction.

The finding lifecycle is a state machine:

States:
- new (just detected)
- validating (PoC execution)
- reviewed (human decision)
- submitted (sent to platform)
- dismissed (false positive)

Transitions:
new → validating (automatic)
validating → validating (confidence adjustment)
validating → reviewed (0.70+ confidence)
reviewed → submitted (human approval)
reviewed → dismissed (human rejection)

Confidence isn't binary. A finding can bounce between states, gaining or losing credibility based on evidence.

I hated state machines in CS classes. But I need them here. The complexity they handle--partial validation, human-in-the-loop gates, platform-specific submission--would be chaos without them.

Where Does This Series Go Next?

This series continues with four more posts covering validation and false positive reduction, failure-driven learning and auto-recovery patterns, multi-platform integration across HackerOne and Bugcrowd, and the ethics of human-in-the-loop security automation. Each part builds on the architecture established here, focusing on the specific engineering and judgment challenges that arise as the system matures.

This is part 1 of a 5-part series on building bug bounty automation:

Architecture & Multi-Agent Design (you are here)
From Detection to Proof: Validation & False Positives
Failure-Driven Learning: Auto-Recovery Patterns
One Tool, Three Platforms: Multi-Platform Integration
Human-in-the-Loop: The Ethics of Security Automation

Next up: why response diff analysis beats payload detection, and how the validation agent reduced my false positive rate from "embarrassing" to "acceptable."

What the First Month Actually Looks Like

Month one is calibration, not hunting.

The system doesn't know your programs yet. The RAG database is empty. Every finding is evaluated without prior context, which means the false positive rate is higher than steady state. Plan for this.

Here's what my first month actually looked like:

Week 1: 22 findings queued. Reviewed all manually. Approved 14, rejected 8. Submitted 12 of the 14 (two were out of scope due to timing). Accepted: 7. Informational/NA: 5.

Week 2: The RAG database starts learning. Three of last week's rejected findings were variations of the same pattern—a debug endpoint returning verbose errors that looks exploitable but isn't. The system logged this as a false positive signature. In week two, it filtered 6 similar findings automatically before they reached the queue.

Week 3: Confidence scoring starts becoming meaningful. Findings I would have rejected on inspection now fail validation before they reach me. 18 findings queued, 15 approved. The ratio is improving.

Week 4: 16 queued, 15 approved, 13 submitted. The RAG database now has enough signal to filter program-specific patterns—things that are valid findings on other programs but not on this specific target due to its technology stack or security posture.

The calibration month isn't glamorous. You're mostly teaching the system what doesn't count. But without it, month two is chaos. The upfront investment in manual rejection—with detailed notes about why you're rejecting—is what makes months three and four productive.

Most builders skip the calibration period because they want results immediately. They get results: all the wrong ones. Patience with month one compounds into efficiency from month three onward.

The calibration investment is also what makes the system genuinely trustworthy over time. After a month of labeled rejections and acceptances, the confidence scores mean something specific to your programs and your testing patterns. A 0.85 finding has a track record behind it. Without calibration, 0.85 is just an arbitrary number someone chose to put in the config file.

Maybe the goal isn't to automate bug bounty hunting. Maybe it's to automate the parts that don't require judgment--so human attention goes where it actually matters.