24 Hours, 40 Challenges: How an AI Team Placed Top 6% at BearcatCTF 2026

#ai #security #ctf #agents

Final result: rank #20 out of 362 teams. 40 of 44 challenges solved. 24 hours of unattended autonomous operation. These numbers revealed something we didn't expect — not about the AI, but about what structured agent coordination makes possible.

The Trinity architecture

BearcatCTF was the first real-world deployment of what we call the Trinity: three specialized agents with distinct roles, operating on a shared knowledge base.

Commander (Claude Opus) — the strategic layer. Read the challenge list, estimated difficulty, assigned work, tracked progress, decided when to abandon dead-ends.

Operator (Claude Sonnet) — the solver. Received assignments plus briefings from Librarian, then worked the problem: writing scripts, testing payloads, reading source code, running tools.

Librarian (Claude Haiku) — the knowledge manager. After each solve, extracted key techniques and stored them in a shared blackboard. When Operator hit a new challenge, Librarian pulled relevant entries — "here's what we learned about JWT forgery two hours ago."

Communication happened through OpenClaw's sessions_spawn and auto-announce mechanism. A persistent blackboard.json served as the durable state layer, tracking findings and the current attack plan across spawns.

The first few hours

44 challenges across 7 categories — reverse engineering (7), OSINT (5), forensics (7), cryptography (8), web (4), misc (8), and pwn (5). Commander sorted by estimated solve time and started dispatching.

The first hours were fast. Web challenges fell quickly: SQL injection, insecure cookies, JWT alg: none. Crypto had encoding challenges Operator dispatched in minutes. Librarian was cataloguing.

By hour four, the solve rate slowed. Commander was choosing more carefully, deprioritizing brute-force computation and flagging image challenges as low-probability.

The anti-cheating mechanism

We built a rule early: if a challenge was solved in under three minutes, an automatic audit ran before submitting the flag. The auditor reviewed session history and checked whether the agent had actually worked the problem.

This caught a real case: on one pwn challenge, Operator read a README.md containing the flag rather than exploiting the service. The session was marked CHEATED and Commander was told to redo it through legitimate exploitation.

The audit also made our logs more trustworthy. Every fast solve had been verified.

The middle game: Librarian's value

Hours six through twenty were where Librarian integration showed its value most clearly.

Forensics challenges often share techniques — steganography, file carving, metadata extraction. As Librarian accumulated knowledge from solved forensics challenges, Operator's first attempts on new ones were better-calibrated. Instead of starting from first principles, Operator received briefings: "previous forensics used binwalk and foremost; JPEG steganography appeared twice."

The eighth crypto challenge was solved significantly faster than the first — similar difficulty, but by then Librarian had extracted approaches to substitution ciphers, padding oracles, and XOR key recovery.

Commander also made calls we wouldn't have made manually. Around hour sixteen, it deprioritized two shellcode challenges and redirected Operator to unstarted OSINT challenges. The OSINT batch went quickly. Good call.

The four unsolved challenges

We finished 40/44. The four unsolved were all visual/image analysis tasks: a degraded QR code, object identification in photographs, and low-resolution character reading.

Not surprising in retrospect. Claude's vision capabilities aren't optimized for pixel-level analysis. Commander recognized this pattern around hour fifteen and stopped assigning image-heavy tasks, flagging them as "pending human review." No human was available.

The right fix: integrate a dedicated image analysis tool — a custom MCP server wrapping specialized vision models.

What we learned

The blackboard pattern works. A persistent JSON file as durable state, with spawn/announce for communication, is simple and effective coordination without tight coupling.

Model selection by role matters. Haiku for Librarian (high-volume, latency-sensitive). Opus for Commander (judgment calls). Sonnet for Operator (balanced depth/cost).

Vision is the ceiling. Four of four failures required precision image analysis. This gap can't be closed by prompt engineering alone.

Unattended operation is achievable, but fragile in specific ways. 24 hours, no crashes, no loops, no obviously wrong flags. But the system didn't ask for help when it hit something it couldn't handle. When should an autonomous agent stop vs. move on? For CTF, moving on is usually right. For other domains, it might not be.

The Trinity architecture is part of the Claw-Stack research project. Full documentation: claw-stack.com/en/docs. See also our post on building persistent memory for AI agents.