I wrapped Gemini Flash with memory and a swarm. It went from 9/12 to 12/12 on a bug benchmark, and the 3 it failed were brutal

#programming #ai #python #machinelearning

I've been building SHARD for a few months: an agentic scaffold that wraps LLMs with persistent memory, multi-agent
swarms, and a nightly self-study loop. Last night I ran a full benchmark — 12 hard Python bug-fix tasks, naked Gemini Flash vs SHARD wrapping the same model.

Tasks fully solved: naked 9/12 → SHARD 12/12.

The 3 it couldn't close alone are worth examining.

The 3 tasks naked LLM failed

T1 — html_trap (naked: 38.9%, SHARD: 100%)
HTML rendering pipeline with XSS injection via unescaped f-strings. The model kept fixing the obvious paths and
missing the edge cases. SHARD's Security reviewer flagged the exact injection vector on attempt 2.

T10 — template_parser (naked: 20%, SHARD: 100%)
Real bug from pylint#7993 — regex .+? vs \w+? inside a template parser. Naked model passed 2/10 tests and confidently
produced wrong output. SHARD passed all 10 on attempt 1 because the GraphRAG had causal context from a prior study
session on regex semantics.

T2 — ghost_bug (naked: 93.8%, SHARD: 100%)
Almost there. Naked missed 1 test out of 16. SHARD closed it at attempt 3 via swarm — the EdgeCases reviewer found the
boundary condition the solo model skipped.

The architecture that fills the gap

Attempt 1: LLM solo
Attempt 2+: Architect → Coder → [Concurrency, Security, EdgeCases, Performance, DataIntegrity]
↑
GraphRAG injects: "regex .+? matches newlines — causes_conflict with line-by-line parsers"

Every reviewer runs in parallel. Their critiques are merged into a single patch prompt. If the same test stays stuck
for 2+ rounds, Focus Mode fires: all reviewers silenced, direct Architect → Coder channel only. This killed a
previously unresolvable stuck-at-15/16 loop in testing.

What surprised me

The cases where SHARD adds zero value (tasks 3-9, 11-12) are exactly the tasks where the bug is local and syntactic.
One wrong operator, one missing None check. A single LLM call is fine for those.

The delta appears on structural bugs: things that require understanding the interaction between components, historical
context about why a pattern breaks, or cross-attempt reasoning. That's where stateless prompting hits its ceiling.

Try it on your own code

git clone https://github.com/Mirage995/shard-v1
python shard_challenge.py buggy.py tests.py

# With a GitHub repo as extra context (uses Repomix)
python shard_challenge.py buggy.py tests.py --repo https://github.com/you/repo

Full benchmark results, task definitions, and architecture doc are in the repo. Happy to dig into any of the
components.

DEV Community

I wrapped Gemini Flash with memory and a swarm. It went from 9/12 to 12/12 on a bug benchmark, and the 3 it failed were brutal

Top comments (0)