Qiushi

Posted on Mar 7 • Originally published at claw-stack.com

Building a Tri-Modal Knowledge Engine for CTF Agents

#ctf #knowledgeretrieval #rag

When Librarian gets asked about tcache stashing, it needs to return something more useful than what a base Claude model knows. The model has a general understanding of heap exploitation — it can describe what tcache is, explain the concept of a stash unlink attack, gesture at the shape of an exploit. But it doesn't know the specific pwntools idiom your teammates used last week, or the exact GDB command that reveals the free list state in the version of libc pinned to the challenge binary. That gap — between general knowledge and actionable specifics — is what the knowledge engine exists to close.

Why "just ask the LLM" isn't enough

A base model's knowledge has two failure modes in CTF contexts.

The first is staleness. CTF challenges often involve recent CVEs, updated tool versions, or techniques documented only in writeups from the past year. A model with a training cutoff doesn't know these. The second is precision. Knowing that GTFOBins documents nmap privilege escalation techniques is not the same as having the exact --script=exec incantation ready to paste. In a time-limited competition, the difference between "the agent knows the theory" and "the agent has the exact command" can be the difference between a solve and a dead end.

There's also a context budget problem. Librarian (Claude Haiku) is called once per challenge and has a fixed context window. You can't embed all of HackTricks in the prompt. You need targeted retrieval: the three most relevant things for this specific challenge, delivered quickly, in a format the agent can act on immediately.

The tri-modal architecture

The knowledge base separates three fundamentally different kinds of retrieval into three separate stores.

Type A — Muscle Memory (SQLite + FTS5)

Type A is for commands you want to copy and paste. The database (ctf_knowledge.db) contains two tables.

binaries holds ~2,739 structured records from GTFOBins and LOLBAS — one row per binary per exploitation method, indexed by name, platform (linux/windows), and function (shell, sudo, suid, download, etc.). These come from the GTFOBins YAML files and a LOLBAS JSON export. The schema is intentionally rigid: name, platform, function, code, description. A query for nmap + sudo returns the exact command, not a description of what nmap can do.

tricks is a SQLite FTS5 full-text search table with ~4,155 records from HackTricks and PayloadsAllTheThings. The build pipeline walks all markdown files, extracts code blocks using regex, and records the surrounding header as context. PayloadsAllTheThings gets filtered: the Intruder, Wordlists, Files, and Images directories are skipped (those assets go to Type C instead), and code blocks longer than 20 lines are dropped — a deliberate choice to keep tricks copy-paste ready rather than turning the table into a script archive.

FTS5 queries work by AND-ing the search terms together (nmap AND sudo AND shell). When FTS5 fails — which happens with punctuation-heavy queries — the gateway falls back to a LIKE search on the first word.

Type B — Cortex (ChromaDB + BGE-M3)

Type B is for methodology, concepts, and writeups. It uses ChromaDB with BAAI/bge-m3 embeddings: 1024-dimensional vectors, normalized, with Metal (MPS) acceleration on Apple Silicon.

The build pipeline (crawl_type_b.py) is a web crawler that reads target URLs from markdown configuration files, spiders up to 500 pages per site, and ingests text into the methodology collection. Pages are chunked by double newline, capped at 1,500 characters per chunk, with chunks under 100 characters discarded. Raw HTML is cached locally so rebuilding the index after changing the embedding model doesn't require re-crawling.

Currently indexed: 0xdf's blog (machine writeups, practical exploitation techniques) and the pwntools documentation. The ctf-wiki repository is cloned locally and available for ingestion but requires separate processing. One wrinkle: trafilatura — the primary extraction library — gets blocked by docs.pwntools.com, so the crawler falls back to urllib for that domain.

A note on the embedding model: we upgraded from a 384-dimension model to BGE-M3 (1024 dimensions) mid-build. ChromaDB doesn't support mixed-dimension collections, so the upgrade required dropping and rebuilding the entire database. The build script handles this automatically, but it means every embedding model upgrade is a full rebuild.

Type C — Arsenal (JSON index)

Type C is for local files: wordlists, web shells, and privilege escalation scripts. Rather than a database, it's a flat JSON index (asset_index.json) mapping names and tags to absolute file paths.

What's in the index: SecLists (password lists, directory wordlists, username lists, fuzzing payloads), PayloadsAllTheThings web shells (.php, .jsp, and others), and PEASS-ng pre-compiled binaries (linpeas.sh, winpeas.bat, winPEASany.exe). The rockyou.txt wordlist is pre-decompressed and ready to use. Each entry carries a category, a tag list, and an absolute path — so when Operator needs to run ffuf -w <path>, Librarian hands back the path, not an instruction to find the path.

Tools like pwntools and ROPgadget are explicitly excluded. These are environment tools that Operator invokes directly; they're TypeD — present in the execution environment but not indexed here. Type C is for files you transfer or reference, not binaries you run.

The LibrarianGateway

The gateway (librarian_gateway.py) is the single interface to all three types. Its job is to route queries and apply automatic fallback and enhancement logic.

TypeA query:
  FTS5 search binaries + tricks
  ├─ hit  → return payloads
  └─ miss → fallback: run TypeB semantic search, label as theory (no ready payload)

TypeB query:
  ChromaDB semantic search
  ├─ hit  → extract keywords (words >4 chars, first 3) → run TypeA lookup
  │          return theory + concrete examples
  └─ miss → nothing

TypeC query:
  JSON substring/tag filter (name, tags, category)
  → return up to 5 matches with absolute paths

The TypeA→TypeB fallback is the most useful path in practice. When Operator asks Librarian for a precise command that doesn't exist in the SQLite database, the gateway doesn't just return nothing — it says "no payload found, but here's the methodology," giving Operator enough theory to reconstruct the approach from scratch.

The TypeB→TypeA enhancement works in the opposite direction. After a semantic search returns methodology results, the gateway extracts keywords from the returned text and runs an FTS5 lookup to find concrete commands that illustrate the theory. This avoids the pattern where the agent understands the concept but has to guess the syntax.

The keyword extraction is crude: take words longer than four characters, pick the first three, run FTS5. It works often enough to be useful but misses short but domain-specific terms like "ROP", "XSS", "SQL", or binary names like nc. This is the part of the system most in need of improvement.

The build pipeline

Building from scratch:

# Type A: parse GTFOBins YAML + LOLBAS JSON + HackTricks MD + PayloadsAllTheThings MD
python3 TypeA/build_db.py

# Type B: crawl configured sites, embed with BGE-M3, store in ChromaDB
python3 TypeB/crawl_type_b.py

# Type C: walk SecLists / PayloadsAllTheThings / PEASS-ng, emit JSON index
python3 TypeC/build_asset_index.py

Type A builds in seconds. Type B is the slow one: BGE-M3 runs inference for every chunk, and crawling a 500-page site with 0.5s politeness delays takes a while. The raw HTML cache means that once crawled, rebuilding the vector index from cache is much faster than re-crawling.

One dependency constraint worth noting: Python 3.14 breaks ChromaDB 1.5.1 due to a Pydantic compatibility issue. The project requires Python 3.10–3.13.

What worked at BearcatCTF

The clearest signal was category accumulation. By the time Operator reached the eighth cryptography challenge, Librarian had enough indexed context — from prior solves and its own sources — that its briefing was materially better calibrated than what it gave for the first challenge. Forensics showed the same pattern: binwalk and foremost appeared in early Librarian responses, and by challenge three, Operator was starting with the right tools rather than discovering them mid-attempt.

Type C was effective for web challenges. When Operator needed to upload a reverse shell or fuzz a directory, Librarian returned absolute paths rather than instructions to find the files. The friction reduction there is small but real in a timed context.

The architecture's weak point was pwn. Type B's coverage of heap exploitation methodology is reasonable — 0xdf's writeups cover it well — but Type A's coverage of specific pwntools invocations is thin. Most GTFOBins entries are for privilege escalation, not binary exploitation. Operator had to reconstruct pwntools boilerplate from the docs rather than retrieving it from an indexed source.

What we would change

Improve Type B coverage for pwn. The 0xdf blog and pwntools docs are the current sources. CTF-wiki is cloned locally but not yet ingested. Adding it, along with targeted crawls of well-known pwn writeup archives, would improve coverage for the challenge categories where theory-to-payload translation matters most.

Fix keyword extraction. The current heuristic (words >4 chars, first 3) was a placeholder that never got replaced. A minimal improvement would be to extract known CTF keywords — CVE numbers, binary names, technique names — before falling back to length heuristics.

Add TypeD integration hints. When Librarian returns a methodology result that implies a specific tool invocation (ROPgadget, pwntools, gdb-peda), it should note the tool and suggest the invocation pattern even if it's not in the index. Currently there's no connection between Type B theory results and the TypeD tools in the execution environment.

Cache invalidation for Type B. The raw HTML cache has no expiration. 0xdf's blog gets new writeups; pwntools docs update with new releases. The current approach requires manually deleting cached files to pick up changes. A TTL or content-hash check would fix this.

The engine in its current form is functional and was net-positive at BearcatCTF. It's also clearly a first version. The architecture is right — the three-way split between immediate payloads, methodology, and local assets maps cleanly onto how a human CTF player actually uses different reference materials. The rough edges are in the population and retrieval quality within each layer, not in the design.

This article was originally published on claw-stack.com. We're building an open-source AI agent runtime — check out the docs or GitHub.

DEV Community