DEV Community

Cover image for I Built an AI-Powered Bug Bounty System: Here's Everything That Happened
Chudi Nnorukam
Chudi Nnorukam

Posted on • Edited on • Originally published at chudi.dev

I Built an AI-Powered Bug Bounty System: Here's Everything That Happened

Originally published at chudi.dev


I ran my first automated vulnerability scan three months ago. Found 47 "critical" vulnerabilities. Submitted 12 reports.

Every single one was a false positive.

That specific embarrassment--of confidently submitting garbage to a program that now knows my name--still stings when I think about it. Traditional scanners generate noise. They don't think. They pattern match and hope something sticks.

Building AI-powered bug bounty automation requires multi-agent architecture where specialized agents handle reconnaissance, testing, validation, and reporting independently. The key innovation isn't automation itself--it's evidence-gated progression where findings must reach 0.85+ confidence through validated proof-of-concept execution before ever reaching human review. This prevents the false positive flood that destroys researcher reputation.


Why Did I Choose Multi-Agent Over Monolithic Scanners?

Monolithic scanners are brittle. Hit a rate limit? Everything stops. Encounter a CAPTCHA? Dead. One endpoint times out? The whole queue backs up.

I built something different--a 4-tier agent system where each agent operates independently:

Recon Agents run passive discovery in parallel. Subdomain enumeration via certificate transparency. Technology fingerprinting with httpx. JavaScript analysis for hidden endpoints. GraphQL introspection. They feed assets into a shared database but never block each other. The vulnerability classes they surface map to the OWASP Top Ten.

Testing Agents take those assets and probe for vulnerabilities. IDOR testing with multi-account replay. XSS payload injection. SQL injection patterns. SSRF with metadata service probing. Maximum 4 concurrent agents to avoid rate limiting--but each recovers independently if throttled. Testing follows the methodology outlined in the OWASP Web Security Testing Guide.

Validation Agent is the gatekeeper. Every finding goes through proof-of-concept execution before advancing. This is where I learned the hard lesson: detection is not exploitation. More on this in part 2 of this series.

Reporter Agent generates platform-specific reports only for validated findings. CVSS scoring, reproducible PoC code, evidence attachments. Vulnerability types are classified using MITRE CWE identifiers. Different formatters for HackerOne, Intigriti, Bugcrowd--covered in part 4.

Without me realizing it, I was building a system that mirrors how good human researchers actually work--parallel reconnaissance, focused testing, ruthless validation, careful reporting.

[!TIP]
The orchestrator (Claude Opus 4.6) coordinates all agents but doesn't do the work itself. It distributes tasks, manages budgets, detects failures, and persists session state. Like a project manager who never touches code.


How Does Evidence-Gated Progression Actually Work?

Every finding has a confidence score from 0.0 to 1.0. But here's what makes this different from typical severity ratings--the score changes as evidence accumulates.

A finding starts at maybe 0.3 when first detected. The Testing Agent found something that looks like reflected XSS. Could be real. Could be the payload appearing in an error message (harmless).

Then Validation runs. PoC execution in a sandboxed environment. Response diff analysis comparing baseline vs. vulnerable responses. False positive signature matching.

If the PoC executes successfully and the response actually demonstrates exploitation (not just reflection)--confidence jumps to 0.85+. Now it's queued for human review.

If PoC fails? Confidence drops. Maybe to 0.4. Still logged for weekly batch review, but not wasting human attention.

Finding Lifecycle:
Discovered (0.3) → Validating (varies) → Reviewed (human) → Submitted/Dismissed
                         ↑
              Confidence may increase or DECREASE
Enter fullscreen mode Exit fullscreen mode

I originally thought validation would only increase confidence. Well, it's more like... validation is the adversary. It's trying to disprove your finding. Survivng that adversarial process is what makes a finding credible.


What Is SQLite RAG and Why Does It Matter?

RAG stands for Retrieval-Augmented Generation. But forget the buzzword--here's what it actually does:

The system remembers. Not just what vulnerabilities it found, but what worked, what failed, and why.

When I start testing a new target running Laravel, the system queries its knowledge base: "What techniques succeeded on Laravel targets before? What false positive patterns should I avoid?"

It retrieves relevant context and adjusts strategy accordingly.

The database has 13 tables but three matter most:

Table Purpose
knowledge_base Semantic embeddings of past findings and techniques
false_positive_signatures Known patterns that look like vulns but aren't
failure_patterns Recovery strategies for different error types

That last table connects directly to part 3 of this series--failure-driven learning is where the system actually gets smarter over time.


How Does the Orchestrator Coordinate Everything?

Claude Opus 4.6 runs the show. But it's constrained by design.

The orchestrator handles:

  • Task distribution: Which agents work on which assets
  • Budget management: API call limits per platform, token usage tracking
  • Failure detection: When an agent hits errors, classify and recover
  • Session persistence: Checkpoint every 5 minutes for crash recovery

Here's the key pattern--the BaseAgent abstraction:

abstract class BaseAgent {
  protected config: AgentConfig;
  abstract execute(params: Record<string, unknown>): Promise<AgentResult>;
  protected async withTimeout<T>(promise, ms) { /* ... */ }
}
Enter fullscreen mode Exit fullscreen mode

Every specialized agent inherits this contract. Testing agents, recon agents, validation--all share consistent timeout handling, error propagation, and rate limit enforcement.

I love constraints like this. They prevent the AI from getting creative in ways that break things.

[!WARNING]
Without the BaseAgent contract, each agent would handle errors differently. Some might retry infinitely. Some might swallow errors silently. The abstraction enforces consistency across a system that could easily become chaotic.


How Do You Resume After a Crash?

Session persistence was born from pain.

I was 6 hours into scanning a large program. Found 3 promising leads. System crashed because my laptop went to sleep.

Lost everything.

Now the system saves state every 5 minutes:

checkpoint = {
  sessionId, timestamp, phase, progress,
  discoveredAssets: [...],
  findings: [...]
};
db.insert('context_snapshots', checkpoint);
Enter fullscreen mode Exit fullscreen mode

Resume with pnpm run start --resume session-id and you're back exactly where you left off.

The database persists everything: programs, assets, findings, sessions, failure logs. Even if the application crashes, the SQLite file survives.


What's the Real Code Pattern?

The finding lifecycle is a state machine:

States:
- new (just detected)
- validating (PoC execution)
- reviewed (human decision)
- submitted (sent to platform)
- dismissed (false positive)

Transitions:
new → validating (automatic)
validating → validating (confidence adjustment)
validating → reviewed (0.70+ confidence)
reviewed → submitted (human approval)
reviewed → dismissed (human rejection)
Enter fullscreen mode Exit fullscreen mode

Confidence isn't binary. A finding can bounce between states, gaining or losing credibility based on evidence.

I hated state machines in CS classes. But I need them here. The complexity they handle--partial validation, human-in-the-loop gates, platform-specific submission--would be chaos without them.


Where Does This Series Go Next?

This is part 1 of a 5-part series on building bug bounty automation:

  1. Architecture & Multi-Agent Design (you are here)
  2. From Detection to Proof: Validation & False Positives
  3. Failure-Driven Learning: Auto-Recovery Patterns
  4. One Tool, Three Platforms: Multi-Platform Integration
  5. Human-in-the-Loop: The Ethics of Security Automation

Next up: why response diff analysis beats payload detection, and how the validation agent reduced my false positive rate from "embarrassing" to "acceptable."

What the First Month Actually Looks Like

Month one is calibration, not hunting.

The system doesn't know your programs yet. The RAG database is empty. Every finding is evaluated without prior context, which means the false positive rate is higher than steady state. Plan for this.

Here's what my first month actually looked like:

Week 1: 22 findings queued. Reviewed all manually. Approved 14, rejected 8. Submitted 12 of the 14 (two were out of scope due to timing). Accepted: 7. Informational/NA: 5.

Week 2: The RAG database starts learning. Three of last week's rejected findings were variations of the same pattern—a debug endpoint returning verbose errors that looks exploitable but isn't. The system logged this as a false positive signature. In week two, it filtered 6 similar findings automatically before they reached the queue.

Week 3: Confidence scoring starts becoming meaningful. Findings I would have rejected on inspection now fail validation before they reach me. 18 findings queued, 15 approved. The ratio is improving.

Week 4: 16 queued, 15 approved, 13 submitted. The RAG database now has enough signal to filter program-specific patterns—things that are valid findings on other programs but not on this specific target due to its technology stack or security posture.

The calibration month isn't glamorous. You're mostly teaching the system what doesn't count. But without it, month two is chaos. The upfront investment in manual rejection—with detailed notes about why you're rejecting—is what makes months three and four productive.

Most builders skip the calibration period because they want results immediately. They get results: all the wrong ones. Patience with month one compounds into efficiency from month three onward.

The calibration investment is also what makes the system genuinely trustworthy over time. After a month of labeled rejections and acceptances, the confidence scores mean something specific to your programs and your testing patterns. A 0.85 finding has a track record behind it. Without calibration, 0.85 is just an arbitrary number someone chose to put in the config file.


Maybe the goal isn't to automate bug bounty hunting. Maybe it's to automate the parts that don't require judgment--so human attention goes where it actually matters.

Top comments (0)