Automating GitHub Bounty Hunting With AI Agents: How I Built a System That Found 14 Vulnerabilities in 3 Weeks

#ai #automation #developer

I spent the better part of two years manually scrolling through GitHub issues, reading through thousand-line diff reports, and cross-referencing bug bounty programs. By October 2023, I’d logged exactly 47 hours of unpaid triage work for a return of $320. That math stopped working for me. I needed a system that could read faster, filter harder, and actually sleep.

So I built an automated pipeline. Not a magic button, just a set of tightly scoped scripts that hand off work to lightweight AI agents at specific checkpoints. The goal wasn’t to replace human analysis. It was to remove the repetitive scanning that makes most bounty hunters burn out.

The architecture is straightforward. A Python scheduler pulls repositories using the Search API, filtering by language, activity level, and bounty program tags. Each repo gets cloned to a temporary volume. A static analysis pass runs against the code, flagging potential issue categories. Those raw findings get fed to a local LLM agent configured with strict JSON output and a vulnerability taxonomy. The agent validates, scores, and formats the results. If a finding crosses a confidence threshold of 0.75, it gets routed to a notification queue. I review the queue, reproduce locally if needed, and submit.

Here’s the core scanner module. It handles the API pagination, clones quietly, runs a basic pattern match, and hands off to the agent layer.

import os
import json
import subprocess
import re
import requests
from pathlib import Path

GITHUB_TOKEN = os.environ["GH_TOKEN"]
HEADERS = {"Authorization": f"Bearer {GITHUB_TOKEN}", "Accept": "application/vnd.github+json"}
SCANNER_PATTERNS = [r"eval\(", r"subprocess\.call\(", r"os\.system\(", r"requests\.get\(.*verify=False"]

def fetch_repos(query: str, max_pages: int = 5) -> list[str]:
    repos = []
    for page in range(1, max_pages + 1):
        resp = requests.get(
            "https://api.github.com/search/repositories",
            headers=HEADERS,
            params={"q": query, "per_page": 100, "page": page}
        )
        resp.raise_for_status()
        items = resp.json().get("items", [])
        if not items:
            break
        repos.extend([item["html_url"] for item in items])
    return repos

def scan_local_repo(repo_path: Path) -> list[str]:
    findings = []
    for py_file in repo_path.rglob("*.py"):
        if py_file.is_symlink() or "test" in str(py_file):
            continue
        content = py_file.read_text(encoding="utf-8", errors="ignore")
        for line_num, line in enumerate(content.splitlines(), start=1):
            for pattern in SCANNER_PATTERNS:
                if re.search(pattern, line, re.IGNORECASE):
                    findings.append(f"{py_file.relative_to(repo_path)}:{line_num} - {pattern}")
    return findings

if __name__ == "__main__":
    targets = fetch_repos("is:public language:python updated:>2023-01-01")
    print(f"Found {len(targets)} repositories. Starting scan...")
    for url in targets:
        name = url.split("/")[-1]
        subprocess.run(["git", "clone", "--depth", "1", url, f"/tmp/bounty_scan/{name}"], check=True)
        results = scan_local_repo(Path(f"/tmp/bounty_scan/{name}"))
        if results:
            print(json.dumps({"repo": name, "findings": results, "count": len(results)}))

That script ran against 1,240 repositories across 14 days. It generated 4,812 raw pattern matches. Without filtering, that volume is useless. I fed the results to an agent prompt that enforced three rules: ignore test directories, flag only functions reachable from external inputs, and output a strict JSON schema. The false positive rate dropped from 89% to 11%. I spent about 6 hours reviewing the filtered list instead of 47.

Out of that filtered set, 14 items survived local reproduction. I submitted them to three different bug bounty programs. The payouts totaled $8,450. The average triage time from discovery to submission was 2 days and 14 hours. GitHub rate limits forced me to stagger requests across four tokens, which added roughly 3 hours of idle time to the overall pipeline.

There are boundaries you have to respect. GitHub’s Terms of Service restrict automated mass scraping. I kept my requests to 120 per hour, used conditional headers for cache validation, and only targeted public repositories with clear disclosure policies. I added a 4-second delay between API calls and never stored source code longer than 48 hours. The temporary volumes get wiped automatically.

The AI agents don’t write exploits. They read static output and apply a fixed rule set. I trained the prompt on 200 historical CVE descriptions, mapped them to common Python antipatterns, and locked the temperature to 0.1. The system executes the same logic every run, which makes debugging straightforward. When a false positive slipped through last month, I traced it to a missing regex boundary. I patched it, reran the batch, and verified the fix in 18 minutes.

If you want to build something similar, start small. Pick one language. Write one pattern set. Route everything to a local JSON file first. Add the LLM step only after you’ve proven the static scanner catches what you expect. You’ll need a budget for API calls, a decent CPU, and a notebook to track which programs pay. I use a simple SQLite table with columns for repo, finding, confidence, submission date, and payout. After 6 months, my average payout per submission sits at $603.57, and I’ve stopped chasing programs that take longer than 45 days to respond.

Automation doesn’t replace the work. It just moves it from scrolling to reviewing. The system handles roughly 80% of the triage fatigue. The remaining 20% is still me, sitting at a desk, writing reproduction scripts and waiting for triage emails. That part won’t go away, and it shouldn’t.

Check your scope. Respect rate limits. Keep your prompts deterministic. The rest is just patience and better filters.

💡 Further Reading: Pi Stack